| CD information is listed by chapter and section number followed by page ranges (3.10:6–9). Page | Addition, 224–29<br>binary, 224–25<br>floating-point, 250–54, 259, | unmapped, 514<br>virtual, 510<br>Address translation |
|-------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|------------------------------------------------------|
| references preceded by a single                                                                 | B-73–74<br>instructions, B-51                                      | AMD Opteron X4, 540<br>defined, 493                  |
| letter refer to appendixes.                                                                     | operands, 225                                                      | fast, 502–4                                          |
|                                                                                                 | significands, 250                                                  | Intel Nehalem, 540                                   |
| 1-bit ALU, C-26–29                                                                              | speed, 229                                                         | TLB for, 502–4                                       |
| adder, C-27                                                                                     | See also Arithmetic                                                | Add unsigned instruction, 226                        |
| CarryOut, C-28                                                                                  | Address-control lines, D-26                                        | Advanced Technology Attachment (ATA)                 |
| illustrated, C-29                                                                               | Addresses                                                          | disks, 577, 613, 614                                 |
| logical unit for AND/OR, C-27                                                                   | 32-bit immediates, 128–36                                          | AGP, A-9                                             |
| for most significant bit, C-33                                                                  | base, 83                                                           | Algol-60, CD2.20:6–7                                 |
| performing AND, OR, and addition,                                                               | byte, 84                                                           | Aliasing, 508                                        |
| C-31, C-33                                                                                      | defined, 82                                                        | Alignment restriction, 84                            |
| See also Arithmetic logic unit (ALU)                                                            | memory, 91                                                         | All-pairs N-body algorithm, A-65                     |
| 32-bit ALU, C-29–38                                                                             | virtual, 493–95, 514                                               | Alpha architecture                                   |
| from 31 copies of 1-bit ALU, C-34                                                               | Addressing                                                         | bit count instructions, E-29                         |
| with 32 1-bit ALUs, C-30                                                                        | 32-bit immediates, 128–36                                          | defined, 527                                         |
| defining in Verilog, C-35–38                                                                    | base, 133                                                          | floating-point instructions, E-28                    |
| illustrated, C-36                                                                               | displacement, 133                                                  | instructions, E-27–29                                |
| ripple carry adder, C-29                                                                        | intermediate, 132, 133                                             | no divide, E-28                                      |
| tailoring to MIPS, C-31–35                                                                      | in jumps and branches, 129–32                                      | PAL code, E-28                                       |
| See also Arithmetic logic unit (ALU)                                                            | MIPS modes, 132–33                                                 | unaligned load-store, E-28                           |
| 32-bit immediate operands, 128–29                                                               | PC-relative, 130, 133                                              | VAX floating-point formats, E-29                     |
| 7090/7094 hardware, CD3.10:6                                                                    | pseudodirect, 133                                                  | ALU control, 316–18                                  |
|                                                                                                 | register, 132, 133                                                 | bits, 317                                            |
| •                                                                                               | x86 modes, 168, 170                                                | logic, D-6                                           |
| A                                                                                               | Addressing modes, B-45–47                                          | mapping to gates, D-4–7                              |
| 41 1 . C . 142                                                                                  | desktop architectures, E-6                                         | truth tables, D-5                                    |
| Absolute references, 142                                                                        | embedded architectures, E-6                                        | See also Arithmetic logic unit (ALU)                 |
| Abstractions                                                                                    | Address select logic, D-24, D-25                                   | ALU control block, 320                               |
| defined, 20                                                                                     | Address space, 492, 496                                            | defined, D-4                                         |
| hardware/software interface,                                                                    | extending, 545                                                     | generating ALU control bits,                         |
| 20–21                                                                                           | flat, 545                                                          | D-6                                                  |
| principle, 21                                                                                   | ID (ASID), 510                                                     | ALUOp, 316, D-6                                      |
| Accumulator architectures, CD2.20:1                                                             | inadequate, CD5.13:5                                               | bits, 317, 318                                       |
| Accumulators, CD2.20:1                                                                          | shared, 639–40                                                     | control signal, 320                                  |
| Acronyms, 8                                                                                     | single physical, 638                                               | AMD64, 167, CD2.20:5                                 |
|                                                                                                 |                                                                    | I-1                                                  |

I-2 Index

| Amdahl's law, 477, 635                 | subtraction, 224–29                 | conditional code assembly, B-17        |
|----------------------------------------|-------------------------------------|----------------------------------------|
| corollary, 52                          | Arithmetic instructions             | defined, 11, B-4                       |
| defined, 51                            | desktop RISC, E-11                  | function, 141, B-10                    |
| fallacy, 684                           | embedded RISC, E-14                 | macros, B-4, B-15–17                   |
| AMD Opteron X4 (Barcelona),            | logical, 308                        | microcode, D-30                        |
| 20, 44–50, 300                         | MIPS, B-51–57                       | number acceptance, 141                 |
| address translation, 540               | operands, 80                        | object file, 141–42                    |
| architectural registers, 404           | See also Instructions               | pseudoinstructions, B-17               |
| base versus fully optimized            | Arithmetic intensity, 668           | relocation information, B-13, B-14     |
| performance, 683                       | Arithmetic logic unit (ALU)         | speed, B-13                            |
| caches, 541                            | 1-bit, C-26–29                      | symbol table, B-12                     |
| characteristics, 677                   | 32-bit, C-29–38                     | Assembly language                      |
| CPI, miss rates, and DRAM accesses,    | before forwarding, 368              | defined, 11, 139                       |
| 542                                    | branch datapath, 312                | drawbacks, B-9–10                      |
| defined, 677                           | hardware, 226                       | floating-point, 260                    |
| illustrated, 676                       | memory-reference instruction        | high-level languages versus, B-12      |
| LBMHD performance, 682                 | use, 301                            | illustrated, 12                        |
| memory hierarchies, 540–43             | for register values, 308            | MIPS, 78, 98–99, B-45–80               |
| microarchitecture, 404, 405            | R-format operations, 310            | production of, B-8–9                   |
| miss penalty reduction techniques,     | signed-immediate input, 371         | programs, 139                          |
| 541–43                                 | See also ALU control; Control units | translating into machine language,     |
| pipeline, 404–6                        | ARM instructions, 161–65            | 98–99                                  |
| pipeline illustration, 406             | 12-bit immediate field, 164         | when to use, B-7–9                     |
| roofline model, 678                    | addressing modes, 161–63            | Asserted signals, 305, C-4             |
| shared L3 cache, 543                   | block loads and stores, 165         | Associativity                          |
| SPEC CPU benchmark, 48–49              | brief history, CD2.20:4             | in caches, 482–83                      |
| SPEC power benchmark,                  | calculations, 161–63                | degree, increasing, 481, 518           |
| 49–50                                  | compare and conditional branch,     | floating-point addition, testing,      |
| SpMV performance, 681                  | 163–64                              | 270–71                                 |
| TLB hardware, 540                      | condition field, 383                | increasing, 486–87                     |
| American Standard Code for Information | data transfer, 162                  | set, tag size versus, 486–87           |
| Interchange. See ASCII                 | features, 164–65                    | Asynchronous interconnect, 583         |
| AND gates, C-12, D-7                   | formats, 164                        | Atomic compare and swap, 139           |
| AND operation, 103–4, B-52, C-6        | logical, 165                        | Atomic exchange, 137                   |
| Annual failure rate (AFR), 573, 613    | MIPS similarities, 162              | Atomic fetch-and-increment, 139        |
| Antidependence, 397                    | register-register, 162              | Atomic memory operation, A-21          |
| Antifuse, C-78                         | unique, E-36–37                     | Attribute interpolation, A-43–44       |
| Apple computer, CD1.10:6–7             | ARPANET, CD6.14:7                   | Availability, 573                      |
| Application binary interface           | Arrays                              | Average memory access time (AMAT), 478 |
| (ABI), 21                              | logic elements, C-18–19             | calculating, 478–79                    |
| Application programming interfaces     | multiple dimension, 266             | defined, 478                           |
| (APIs)                                 | pointers versus, 157–61             | defined, 170                           |
| defined, A-4                           | procedures for setting to zero,     | R                                      |
| graphics, A-14                         | 158                                 | В                                      |
| Architectural registers, 404           | ASCII                               | Backpatching, B-13                     |
| Arithmetic, 222–83                     | binary numbers versus, 123          | Backplane bus, 582                     |
| addition, 224–29                       | character representation, 122       | Backups, 615–16                        |
| division, 236–42                       | defined, 122                        | Bandwidth                              |
| floating point, 242–70                 | symbols, 126                        | bisection, 661                         |
| for multimedia, 227–28                 | Assembler directives, B-5           | external to DRAM, 474                  |
| multiplication, 230–36                 | Assemblers, 140–42, B-10–17         | I/O, 618                               |
|                                        |                                     | -, -, 010                              |

| L2 cache, 675                      | sticky, 268                          | Branch not taken                    |
|------------------------------------|--------------------------------------|-------------------------------------|
| memory, 471, 472                   | valid, 458                           | assumption, 377                     |
| network, 661                       | Blocking assignment, C-24            | defined, 311                        |
| Barrier synchronization, A-18      | Block-interleaved parity, 602–3      | Branch-on-equal instruction, 326    |
| defined, A-20                      | Blocks                               | Branch prediction                   |
| for thread communication, A-34     | combinational, C-4                   | buffers, 380, 381                   |
| Base addressing, 83, 133           | defined, 454                         | as control hazard solution, 342     |
| Base registers, 83                 | finding, 519–20                      | defined, 341                        |
| Basic block, 108–9                 | flexible placement, 479–84           | dynamic, 341, 342, 380-83           |
| Benchmarks                         | least recently used (LRU), 485       | static, 393                         |
| defined, 48                        | loads/stores, 165                    | Branch predictors                   |
| I/O, 596–98                        | locating in cache, 484–85            | accuracy, 381                       |
| Linpack, 664, CD3.10:3             | miss rate and, 465                   | correlation, 383                    |
| multicores, 657–84                 | multiword, mapping addresses         | information from, 382               |
| multiprocessor, 664–66             | to, 463–64                           | tournament, 383                     |
| NAS parallel, 666                  | placement locations, 518-19          | Branch taken                        |
| parallel, 665                      | placement strategies, 481            | cost reduction, 377                 |
| PARSEC suite, 666                  | replacement selection, 485           | defined, 311                        |
| SPEC CPU, 48–49                    | replacement strategies, 520–21       | Branch target                       |
| SPEC power, 49–50                  | spatial locality exploitation, 464   | addresses, 310                      |
| SPECrate, 664                      | state, C-4                           | buffers, 383                        |
| SPLASH/SPLASH 2, 664–66            | valid data, 458                      | Bubbles, 374                        |
| Stream, 675                        | Boolean algebra, C-6                 | Bubble Sort, 156                    |
| Biased notation, 94, 247           | Bounds check shortcut, 110           | Bus-based coherent multiprocessors, |
| Big-endian byte order, 84, B-43    | Branch datapath                      | CD7.14:6                            |
| Binary digits. See Bits            | ALU, 312                             | Buses, 584, 585                     |
| Binary numbers                     | operations, 311                      | backplane, 582                      |
| ASCII versus, 123                  | Branch delay slots                   | defined, C-19                       |
| conversion to decimal numbers, 90  | defined, 381                         | processor-memory, 582               |
| conversion to hexadecimal numbers, | scheduling, 382                      | synchronous, 583                    |
| 96                                 | Branch equal, 377                    | Bytes                               |
| defined, 87                        | Branches                             | addressing, 84                      |
| Bisection bandwidth, 661           | addressing in, 129-32                | order, 84, B-43                     |
| Bit error rate (BER), CD6.11:9     | compiler creation, 107               |                                     |
| Bit-interleaved parity, 602        | condition, 313                       | C                                   |
| Bit maps, 17                       | decision, moving up, 377             |                                     |
| defined, 16, 87                    | delayed, 111, 313, 343, 377-79, 381, | Cache-aware instructions, 547       |
| goal, 17                           | 382                                  | Cache coherence, 534–38             |
| storing, 17                        | ending, 108                          | coherence, 534                      |
| Bits                               | execution in ID stage, 378           | consistency, 535                    |
| ALUOp, 317, 318                    | pipelined, 378                       | enforcement schemes, 536            |
| defined, 11                        | target address, 378                  | implementation techniques,          |
| dirty, 501                         | unconditional, 106                   | CD5.9:10-11                         |
| done, 588                          | See also Conditional branches        | migration, 536                      |
| error, 588                         | Branch hazards. See Control hazards  | problem, 534, 535, 538              |
| guard, 266–67                      | Branch history tables. See Branch    | protocol example, CD5.9:11-15       |
| patterns, 269                      | prediction, buffers                  | protocols, 536                      |
| reference, 499                     | Branch instructions, B-59-63         | replication, 536                    |
| rounding, 268                      | jump instruction versus, 328         | snooping protocol, 536–537–538      |
| sign, 90                           | list of, B-60-63                     | snoopy, CD5.9:16                    |
| state, D-8                         | pipeline impact, 376                 | state diagram, CD5.9:15             |

I-4 Index

| Cache coherency protocol, CD5.9:11–15 | inconsistent, 466                                            | fields, B-34, B-35                            |
|---------------------------------------|--------------------------------------------------------------|-----------------------------------------------|
| finite-state transition diagram,      | index, 460                                                   | illustrated, 591                              |
| CD5.9:12, CD5.9:14                    | Intrinsity FastMATH example,                                 | CDC 6600, CD1.10:6, CD4.15:2                  |
| functioning, CD5.9:12                 | 468–70                                                       | Central processor unit (CPU)                  |
| mechanism, CD5.9:13                   | locating blocks in, 484-85                                   | classic performance equation, 35–37           |
| state diagram, CD5.9:15               | locations, 458                                               | coprocessor 0, B-33–34                        |
| states, CD5.9:11–12                   | memory system design, 471–74                                 | defined, 19                                   |
| write-back cache, CD5.9:12            | multilevel, 475, 487–91                                      | execution time, 30, 31, 32                    |
| Cache controllers, 538                | nonblocking, 541                                             | performance, 30–32                            |
| cache coherency protocol,             | physically addressed, 508                                    | system, time, 30                              |
| CD5.9:11–15                           | physically indexed, 507                                      | time, 475                                     |
| coherent cache implementation         | physically tagged, 507                                       | time measurements, 31                         |
| techniques, CD5.9:10–11               | primary, 488, 489, 492                                       | user, time, 30                                |
| implementing, CD5.9:1–16              | secondary, 488, 489, 492                                     | See also Processors                           |
| snoopy cache coherence, CD5.9:16      | set-associative, 479                                         | Cg pixel shader program, A-15–17              |
| SystemVerilog, CD5.9:1–9              | simulating, 543–44                                           | Channel controllers, 593                      |
| Cache hits, 508                       | size, 462                                                    | Characters                                    |
| Cache misses                          | split, 470                                                   | ASCII representation, 122                     |
| block replacement on, 520–21          | summary, 474–75                                              | in Java, 126–27                               |
| capacity, 523                         | tag field, 460                                               | Chips. See Integrated circuits (ICs)          |
| compulsory, 523                       | tags, CD5.9:10, CD5.9:11                                     | C++ language, CD2.15:26, CD2.20:7             |
| conflict, 523                         | virtually addressed, 508                                     | C language                                    |
| defined, 465                          | virtually indexed, 508                                       | assignment, compiling into MIPS,              |
| direct-mapped cache, 482              | virtually tagged, 508                                        | 79–80                                         |
|                                       | virtually tagged, 506<br>virtual memory and TLB integration, |                                               |
| fully associative cache, 483          | 504–8                                                        | compiling assignment with registers           |
| handling, 465–66                      |                                                              | compiling assignment with registers,<br>81–82 |
| memory-stall clock cycles, 475        | write-back, 467, 468, 521, 522                               |                                               |
| reducing with flexible block          | writes, 466–68                                               | compiling while loops in, 107–8               |
| placement, 479–84                     | write-through, 467, 468, 521, 522                            | sort algorithms, 157                          |
| set-associative cache, 482–83         | See also Blocks                                              | translation hierarchy, 140                    |
| steps, 466                            | Callee, 113, 116                                             | translation to MIPS assembly                  |
| in write-through cache, 467           | Callee-saved register, B-23                                  | language, 79                                  |
| Cache performance, 475–92             | Caller, 113                                                  | variables, 118                                |
| calculating, 477                      | Caller-saved register, B-23                                  | Classes                                       |
| hit time and, 478                     | Capabilities, CD5.13:7                                       | defined, CD2.15:14                            |
| impact on processor                   | Capacity misses, 523                                         | packages, CD2.15:20                           |
| performance, 476–77                   | Carry lookahead, C-38–47                                     | Clock cycles                                  |
| Caches, 457–75                        | 4-bit ALUs using, C-45                                       | defined, 31                                   |
| accessing, 459–65                     | adder, C-39                                                  | memory-stall, 475, 476                        |
| associativity in, 482–83              | fast, with first level of abstraction,                       | number of registers and, 81                   |
| bits in, 463                          | C-39–40                                                      | worst-case delay and, 330                     |
| bits needed for, 460                  | fast, with "infinite" hardware,                              | Clock cycles per instruction (CPI), 33–34     |
| contents illustration, 461            | C-38–39                                                      | 341                                           |
| defined, 20, 457                      | fast, with second level of abstraction,                      | one level of caching, 488                     |
| direct-mapped, 457, 459, 463, 479     | C-40–46                                                      | two levels of caching, 488                    |
| disk controller, 578                  | plumbing analogy, C-42, C-43                                 | Clocking methodology, 305–7, C-48             |
| empty, 460                            | ripple carry speed versus, C-46                              | defined, 305                                  |
| flushing, 595                         | summary, C-46–47                                             | edge-triggered, 305, 306, C-48,               |
| FSM for controlling, 529–39           | Carry save adders, 235                                       | C-73                                          |
| fully associative, 479                | Cause register, 590                                          | level-sensitive, C-74, C-75–76                |
| GPU, A-38                             | defined, 386                                                 | for predictability, 305                       |

| Cl. 1                                 | C                                           |                                      |
|---------------------------------------|---------------------------------------------|--------------------------------------|
| Clock rate defined, 31                | Comparison instructions, B-57–59            | principles, 100<br>rack mount, 606   |
| frequency switched as                 | floating-point, B-74–75<br>list of, B-57–59 |                                      |
|                                       |                                             | servers, 5                           |
| function of, 40                       | Comparisons, 108–9                          | Compute Unified Device Architecture. |
| power and, 39                         | constant operands in, 109                   | See CUDA programming                 |
| Clocks, C-48–50                       | signed versus unsigned, 110                 | environment<br>Conditional branches  |
| edge, C-48, C-50                      | Compilers, 139                              |                                      |
| in edge-triggered design, C-73        | branch creation, 107                        | ARM, 163                             |
| skew, C-74                            | brief history, CD2.20:8                     | changing program counter             |
| specification, C-57                   | conservative, CD2.15:5–6                    | with, 383                            |
| synchronous system, C-48–49           | defined, 11                                 | compiling if-then-else into, 106     |
| Clusters, CD7.14:7–8                  | front end, CD2.15:2                         | defined, 105                         |
| defined, 632, 641, CD7.14:7           | function, 13, 139, B-5–6                    | desktop RISC, E-16                   |
| drawbacks, 642                        | high-level optimizations,                   | embedded RISC, E-16                  |
| isolation, 644                        | CD2.15:3–4                                  | implementation, 112                  |
| organization, 631                     | ILP exploitation, CD4.15:4–5                | in loops, 130                        |
| overhead in division of memory, 642   | Just In Time (JIT), 148                     | PA-RISC, E-34, E-35                  |
| scientific computing on, CD7.14:7     | machine language production, B-8–9,         | PC-relative addressing, 130          |
| Cm*, CD7.14:3–4                       | B-10                                        | RISC, E-10–16                        |
| C.mmp, CD7.14:3                       | optimization, 160, CD2.20:8                 | SPARC, E-10–12                       |
| Coarse-grained multithreading, 645–46 | speculation, 392–93                         | Conditional move instructions,       |
| Cobol, CD2.20:6                       | structure, CD2.15:1                         | 383                                  |
| Code generation, CD2.15:12            | Compiling                                   | Condition field, 383                 |
| Code motion, CD2.15:6                 | C assignment statements, 79–80              | Conflict misses, 523                 |
| Combinational blocks, C-4             | C language, 107–8, 161, CD2.15:1–2          | Constant-manipulating                |
| Combinational control units, D-4–8    | floating-point programs, 262-65             | instructions, B-57                   |
| Combinational elements, 304           | if-then-else, 106                           | Constant memory, A-40                |
| Combinational logic, 306, C-3, C-9–20 | in Java, CD2.15:18–19                       | Constant operands, 86–87             |
| arrays, C-18–19                       | procedures, 114, 117–18                     | in comparisons, 109                  |
| decoders, C-9                         | recursive procedures, 117–18                | frequent occurrence, 87              |
| defined, C-5                          | while loops, 107–8                          | Content Addressable Memory           |
| don't cares, C-17–18                  | Compressed sparse row (CSR) matrix,         | (CAM), 485                           |
| multiplexors, C-10                    | A-55, A-56                                  | Context switch, 510                  |
| ROMs, C-14–16                         | Compulsory misses, 523                      | Control                              |
| two-level, C-11-14                    | Computers                                   | ALU, 316–18                          |
| Verilog, C-23–26                      | application classes, 5–7                    | challenge, 384                       |
| Commands, to I/O devices, 588-89      | applications, 4                             | finishing, 327                       |
| Commercial computer development,      | arithmetic for, 222–83                      | forwarding, 366                      |
| CD1.10:3-9                            | characteristics, CD1.10:12                  | FSM, D-8–21                          |
| Commit units                          | commercial development,                     | implementation, optimizing,          |
| buffer, 399                           | CD1.10:3-9                                  | D-27-28                              |
| defined, 399                          | component organization, 14                  | for jump instruction, 329            |
| in update control, 402                | components, 14, 223, 569                    | mapping to hardware, D-2-32          |
| Common case fast, 177                 | design measure, 55                          | memory, D-26                         |
| Common subexpression elimination,     | desktop, 5, 15                              | organizing, to reduce logic,         |
| CD2.15:5                              | embedded, 5–7, B-7                          | D-31-32                              |
| Communication, 24–25                  | first, CD1.10:1–3                           | pipelined, 359-63                    |
| overhead, reducing, 43                | in information revolution, 4                | Control flow graphs, CD2.15:8-9      |
| thread, A-34                          | instruction representation, 94-101          | defined, CD2.15:8                    |
| Compact code, CD2.20:3                | laptop, 18                                  | illustrated examples, CD2.15:8,      |
| Compact disks (CDs), 23, 24           | performance measurement, CD1.10:9           | CD2.15:9                             |

| Control functions                      | MIPS, D-10                           | D                                      |
|----------------------------------------|--------------------------------------|----------------------------------------|
| ALU, mapping to gates, D-4–7           | next-state outputs, D-10, D-12–13    |                                        |
| defining, 321                          | output, 316–17, D-10                 | Databases                              |
| PLA, implementation, D-7,              | See also Arithmetic logic unit (ALU) | brief history, CD6.14:4                |
| D-20-21                                | Conversion instructions, B-75–76     | Integrated Data Store (IDS), CD6.14:4  |
| ROM, encoding, D-18–19                 | Cooperative thread arrays (CTAs),    | relational, CD6.14:5                   |
| for single-cycle implementation,       | A-30                                 | Datacenters, 5                         |
| 327                                    | Coprocessors                         | Data flow analysis, CD2.15:8           |
| Control hazards, 339-43, 375-84        | coprocessor 0, B-33–34               | Data hazards, 336–39, 363–75           |
| branch delay reduction, 377-79         | defined, 266                         | defined, 336                           |
| branch not taken assumption, 377       | move instructions, B-71–72           | forwarding, 336, 363–75                |
| branch prediction as solution, 342     | Copy back. See Write-back            | load-use, 338, 377                     |
| defined, 339, 376                      | Core MIPS instruction set, 282       | stalls and, 371–74                     |
| delayed decision approach, 343         | abstract view, 302                   | See also Hazards                       |
| dynamic branch prediction, 380–83      | desktop RISC, E-9–11                 | Data layout directives, B-14           |
| logic implementation in Verilog,       | implementation, 300–303              | Data-level parallelism, 649            |
| CD4.12:7–9                             | implementation illustration, 304     | Data movement instructions, B-70–73    |
| pipeline stalls as solution, 340       | overview, 301–3                      | Data parallel problem decomposition,   |
| pipeline summary, 383–84               | subset, 300–301                      | A-17, A-18                             |
| simplicity, 376                        | See also MIPS                        | Datapath elements                      |
| solutions, 340                         | Cores                                | defined, 307                           |
| static multiple-issue processors and,  | defined, 41                          | sharing, 313                           |
| 394                                    | number per chip, 42                  | Datapaths                              |
| Control lines                          | Correcting code, 602                 | branch, 311, 312                       |
| asserted, 323                          | Correlation predictor, 383           | building, 307–16                       |
| in datapath, 320                       | Cosmic Cube, CD7.14:6                | control signal truth tables, D-14      |
| execution/address calculation, 361     | Count register, B-34                 | control unit, 322                      |
| final three stages, 361                | Cray computers, CD3.10:4, CD3.10:5   | defined, 19                            |
| instruction decode/register file read, | Critical word first, 465             | design, 307                            |
| 361                                    | Crossbar networks, 662               | exception handling, 387                |
| instruction fetch, 361                 | CTSS (Compatible Time-Sharing        | for fetching instructions, 309         |
| memory access, 362                     | System), CD5.13:8                    | for hazard resolution via forwarding,  |
| setting of, 321, 323                   | CUDA programming environment, 659,   | 370                                    |
| values, 360                            | A-5, CDA.11:5                        | for jump instruction, 329              |
| write-back, 362                        | barrier synchronization, A-18, A-34  | for memory instructions, 314           |
| Control signals                        | defined, A-5                         | for MIPS architecture, 315             |
| ALUOp, 320                             | development, A-17, A-18              | in operation for branch-on-equal       |
| defined, 306                           | hierarchy of thread groups, A-18     | instruction, 326                       |
| effect of, 321                         | kernels, A-19, A-24                  | in operation for load instruction, 325 |
| multi-bit, 322                         | key abstractions, A-18               | in operation for R-type                |
| pipelined datapaths with, 359          | paradigm, A-19–23                    | instruction, 324                       |
| truth tables, D-14                     | parallel plus-scan template, A-61    | operation of, 321–26                   |
| Control units, 303                     | per-block shared memory, A-58        | pipelined, 344–58                      |
| address select logic, D-24, D-25       | plus-reduction implementation,       | for R-type instructions, 314, 323      |
| combinational, implementing,           | A-63                                 | single, creating, 313–16               |
| D-4–8                                  | programs, A-6, A-24                  | single-cycle, 345                      |
| with explicit counter, D-23            | scalable parallel programming with,  | static two-issue, 395                  |
| illustrated, 322                       | A-17–23                              | Data race, 137                         |
| logic equations, D-11                  | SDK, 172                             | Data rate, 596                         |
| main, designing, 318–26                | shared memories, A-18                | Data segment, B-13                     |
| as microcode, D-28                     | threads, A-36                        | Data selectors, 303                    |
|                                        |                                      |                                        |

| Data structure compression, 680      | arithmetic/logical instructions,     | Disk storage, 575–79                 |
|--------------------------------------|--------------------------------------|--------------------------------------|
| Data transfer instructions           | E-11                                 | characteristics, 579                 |
| defined, 82                          | conditional branches, E-16           | densities, 577                       |
| load, 83                             | constant extension summary, E-9      | history, CD6.14:1–4                  |
| offset, 83                           | control instructions, E-11           | interfaces, 577–78                   |
| store, 85                            | conventions equivalent to MIPS core, | as nonvolatile, 575                  |
| See also Instructions                | E-12                                 | rotational latency, 576              |
| Deasserted signals, 305, C-4         | data transfer instructions, E-10     | sectors, 575                         |
| Debugging information, B-13          | features added to, E-45              | seek time, 575                       |
| DEC disk drive, CD6.14:3             | floating-point instructions, E-12    | tracks, 575                          |
| Decimal numbers                      | instruction formats, E-7             | transfer time, 576                   |
| binary number conversion to, 90      | multimedia extensions, E-16-18       | Displacement addressing, 133         |
| defined, 87                          | multimedia support, E-18             | Divide algorithm, 239                |
| Decision-making instructions, 105–12 | types of, E-3                        | Dividend, 237                        |
| Decoders, C-9                        | See also Reduced instruction set     | Division, 236–42                     |
| defined, C-9                         | computer (RISC) architectures        | algorithm, 238                       |
| two-level, C-65                      | Desktop computers                    | dividend, 237                        |
| Decoding machine language, 134       | defined, 5                           | divisor, 237                         |
| DEC PDP-8, CD1.10:5                  | illustrated, 15                      | faster, 241                          |
| Deep Web, CD6.14:8                   | D flip-flops, C-51, C-53             | floating-point, 259, B-76            |
| Delayed branches, 111                | Dicing, 46                           | hardware, 237–39                     |
| as control hazard solution, 343      | Dies, 46                             | hardware, improved version,          |
| defined, 313                         | Digital design pipeline, 406–7       | 240                                  |
| embedded RISCs and, E-23             | Digital signal-processing (DSP)      | instructions, B-52–53                |
| for five-stage pipelines, 382        | extensions, E-19                     | in MIPS, 241–42                      |
| reducing, 377–79                     | Digital video disks (DVDs), 23, 24   | operands, 237                        |
| scheduling limitations, 381          | DIMMs (dual inline memory modules),  | quotient, 237                        |
| See also Branches                    | CD5.13:4                             | remainder, 237                       |
| Delayed decision, 343                | Direct3D, A-13                       | signed, 239–41                       |
| DeMorgan's theorems, C-11            | Direct-mapped caches                 | SRT, 241                             |
| Denormalized numbers, 270            | address portions, 484                | See also Arithmetic                  |
| Dependences                          | choice of, 520                       | Divisor, 237                         |
| bubble insertion and, 374            | defined, 457, 479                    | Divisor, 237 D latches, C-51, C-52   |
|                                      | illustrated, 459                     |                                      |
| detection, 365                       |                                      | Done bit, 588                        |
| name, 397                            | memory block location, 480           | Don't cares, C-17–18                 |
| between pipeline registers, 367      | misses, 482                          | example, C-17–18                     |
| between pipeline registers and ALU   | single comparator, 485               | term, 318                            |
| inputs, 366                          | total number of bits, 463            | Double Data Rate RAMs (DDRRAMs)      |
| sequence, 363                        | See also Caches                      | 473, C-65                            |
| Design                               | Direct memory access (DMA)           | Double precision                     |
| compromises and, 177                 | defined, 592                         | defined, 245                         |
| datapath, 307                        | multiple devices, 593                | FMA, A-45–46                         |
| digital, 406–7                       | setup, 593                           | GPU, A-45–46, A-74                   |
| I/O system, 598–99                   | transfers, 593, 595                  | representation, 249                  |
| logic, 303-7, C-1-79                 | Dirty bit, 501                       | See also Single precision            |
| main control unit, 318–26            | Dirty pages, 501                     | Double words, 168                    |
| memory hierarchy, challenges, 525    | Disk controllers                     | Dynamically linked libraries (DLLs), |
| pipelining instruction sets, 335     | caches, 578                          | 145–46                               |
| Desktop and server RISCs             | defined, 576                         | defined, 146                         |
| addressing modes, E-6                | time, 576                            | lazy procedure linkage version,      |
| architecture summary, E-4            | Disk read time, 577                  | 146, 147                             |

I-8

| Dynamic branch prediction, 380-83       | EDSAC (Electronic Delay Storage        | Ethernet, 24, 25, CD6.14:8                 |
|-----------------------------------------|----------------------------------------|--------------------------------------------|
| branch prediction buffer, 380           | Automatic Calculator), CD1.10:2,       | defined, CD6.11:5                          |
| defined, 380                            | CD5.13:1-2                             | multiple, CD6.11:6                         |
| loops and, 380                          | Eispack, CD3.10:3                      | success, CD6.11:5                          |
| See also Control hazards                | Electrically erasable programmable     | Exception enable, 512                      |
| Dynamic hardware predictors, 341        | read-only memory (EEPROM),             | Exception handlers, B-36-38                |
| Dynamic multiple-issue processors, 392, | 581                                    | defined, B-35                              |
| 397–400                                 | Elements                               | return from, B-38                          |
| pipeline scheduling, 398-400            | combinational, 304                     | Exception program counters                 |
| superscalar, 397                        | datapath, 307, 313                     | (EPCs), 385                                |
| See also Multiple issue                 | memory, C-50-58                        | address capture, 390                       |
| Dynamic pipeline scheduling,            | state, 305, 306, 308, C-48, C-50       | copying, 227                               |
| 399–400                                 | Embedded computers                     | defined, 227, 386                          |
| commit unit, 399                        | application requirements, 7            | in restart determination, 385              |
| concept, 400                            | defined, B-7                           | transferring, 229                          |
| defined, 398                            | design, 6                              | Exceptions, 384–91, B-35–36                |
| hardware-based speculation, 400         | growth, CD1.10:11–12                   | association, 390                           |
| primary units, 399                      | Embedded Microprocessor                | datapath with controls for                 |
| reorder buffer, 399                     | Benchmark Consortium                   | handling, 387                              |
| reservation station, 399                | (EEMBC), CD1.10:11-12                  | defined, 227, 385                          |
| Dynamic random access memory            | Embedded RISCs                         | detecting, 385                             |
| (DRAM), 453, 471, C-63–65               | addressing modes, E-6                  | event types and, 385                       |
| bandwidth external to, 474              | architecture summary, E-4              | imprecise, 390                             |
| cost, 23                                | arithmetic/logical instructions, E-14  | instructions, B-80                         |
| defined, 18-19, C-63                    | conditional branches, E-16             | interrupts versus, 384–85                  |
| DIMM, CD5.13:4                          | constant extension summary, E-9        | in MIPS architecture, 385-86               |
| Double Date Rate (DDR), 473             | control instructions, E-15             | overflow, 387                              |
| early board, CD5.13:4                   | data transfer instructions, E-13       | PC, 509, 511                               |
| GPU, A-37–38                            | delayed branch and, E-23               | pipelined computer example, 388            |
| growth of capacity, 27                  | DSP extensions, E-19                   | in pipelined implementation,               |
| history, CD5.13:3-4                     | general purpose registers, E-5         | 386–91                                     |
| pass transistor, C-63                   | instruction conventions, E-15          | precise, 390                               |
| SIMM, CD5.13:4, CD5.13:5                | instruction formats, E-8               | reasons for, 385-86                        |
| single-transistor, C-64                 | multiply-accumulate approaches, E-19   | result due to overflow in add              |
| size, 474                               | types of, E-4                          | instruction, 389                           |
| speed, 23                               | See also Reduced instruction set       | saving/restoring stage on, 515             |
| synchronous (SDRAM), 473, C-60,         | computer (RISC) architectures          | Exclusive OR (XOR) instructions, B-57      |
| C-65                                    | Encoding                               | Executable files, B-4                      |
| two-level decoder, C-65                 | defined, D-31                          | defined, 142                               |
|                                         | floating-point instruction, 261        | linker production, B-19                    |
| E                                       | MIPS instruction, 98, 135, B-49        | Execute/address calculation                |
|                                         | ROM control function, D-18-19          | control line, 361                          |
| Early restart, 465                      | ROM logic function, C-15               | load instruction, 350                      |
| Edge-triggered clocking methodology,    | x86 instruction, 171–72                | store instruction, 352                     |
| 305, 306, C-48, C-73                    | ENIAC (Electronic Numerical Integrator | Execute or address calculation stage, 350, |
| advantage, C-49                         | and Calculator), CD1.10:1,             | 352                                        |
| clocks, C-73                            | CD1.10:2, CD1.10:3, CD5.13:1           | Execution time                             |
| defined, C-48                           | EPIC, CD4.15:4                         | CPU, 30, 31, 32                            |
| drawbacks, C-74                         | Error bit, 588                         | pipelining and, 344                        |
| illustrated, C-50                       | Error correction, C-65–67              | as valid performance measure, 54           |
| rising edge/falling edge, C-48          | Error detection, 602, C-66             | Explicit counters, D-23, D-26              |
|                                         |                                        |                                            |

| T                                        | P: 11                                    | 1.6. 1.044                           |
|------------------------------------------|------------------------------------------|--------------------------------------|
| Exponents, 244–45                        | Fields                                   | defined, 244                         |
| EX stage                                 | Cause register, B-34, B-35               | diversity versus portability,        |
| load instructions, 350                   | defined, 95                              | CD3.10:2–3                           |
| overflow exception detection, 387        | format, D-31                             | division, 259                        |
| store instructions, 353                  | MIPS, 96–97                              | first dispute, CD3.10:1–2            |
| External labels, B-10                    | names, 97                                | form, 245                            |
| _                                        | Status register, B-34, B-35              | fused multiply add, 268              |
| F                                        | Filebench, 597                           | guard digits, 266–67                 |
|                                          | Files, register, 308, 314, C-50, C-54–56 | history, CD3.10:1–10                 |
| Facilities, B-14–17                      | File server benchmark (SPECFS), 597      | IEEE 754 standard, 246, 247          |
| Failures                                 | Fine-grained multithreading, 645, 647    | immediate calculations, 266          |
| disk, rates, 613–14                      | Finite-state machines (FSMs), 529–34,    | instruction encoding, 261            |
| mean time between (MTBF), 573            | C-67–72                                  | machine language, 260                |
| mean time to (MTTF), 573, 574,           | control, D-8–22                          | MIPS instruction frequency for, 282  |
| 613, 630                                 | controllers, 532                         | MIPS instructions, 259–61            |
| reasons for, 574                         | defined, 531, C-67                       | operands, 260                        |
| synchronizer, C-77                       | implementation, 531, C-70                | operands variation in x86, 274       |
| Fallacies                                | Mealy, 532                               | overflow, 245                        |
| add immediate unsigned, 276              | Moore, 532                               | packed format, 274                   |
| Amdahl's law, 684                        | for multicycle control, D-9              | precision, 271                       |
| assembly language for performance,       | next-state function, 531, C-67           | procedure with two-dimensional       |
| 174–75                                   | output function, C-67, C-69              | matrices, 263-65                     |
| commercial binary compatibility          | for simple cache controller, 533         | programs, compiling, 262-65          |
| importance, 175                          | state assignment, C-70                   | registers, 265                       |
| defined, 51                              | state register implementation, C-71      | representation, 244-50               |
| disk failure rates, 613–14               | style of, 532                            | rounding, 266–67                     |
| GPUs, A-72–74, A-75                      | synchronous, C-67                        | sign and magnitude, 245              |
| low utilization uses little power, 52    | SystemVerilog, CD5.9:6-9                 | SSE2 architecture, 274–75            |
| MTTF, 613                                | traffic light example, C-68-70           | subtraction, 259                     |
| peak performance, 684–85                 | Fixed-function graphics pipelines,       | underflow, 245                       |
| pipelining, 407                          | CDA.11:1                                 | units, 267                           |
| powerful instructions mean higher        | Flash-based removable memory             | in x86, 272–74                       |
| performance, 174                         | cards, 23                                | Floating-point addition, 250-54      |
| right shift, 275–76                      | Flash memory, 580–82                     | arithmetic unit block diagram, 254   |
| See also Pitfalls                        | brief history, CD6.14:4                  | associativity, testing, 270–71       |
| False sharing, 537                       | characteristics, 23, 580                 | binary, 251, 253                     |
| Fast carry                               | defined, 22, 580                         | illustrated, 252                     |
| with first level of abstraction,         | as EEPROM, 581                           | instructions, 259, B-73-74           |
| C-39-40                                  | NAND, CD6.14:4                           | steps, 250–51                        |
| with "infinite" hardware, C-38-39        | NOR, 581, CD6.14:4                       | Floating-point arithmetic (GPUs),    |
| with second level of abstraction,        | wear leveling, 581                       | A-41–46                              |
| C-40-46                                  | Flat address space, 545                  | basic, A-42                          |
| Fast Fourier Transforms (FFT), A-53      | Flip-flops                               | double precision, A-45-46, A-74      |
| Fiber Distributed Data Interface (FDDI), | defined, C-51                            | performance, A-44                    |
| CD6.14:8                                 | D flip-flops, C-51, C-53                 | specialized, A-42–44                 |
| Fibre Channel Arbitrated Loop            | Floating point, 242–70                   | supported formats, A-42              |
| (FC-AL), CD6.11:11                       | assembly language, 260                   | texture operations, A-44             |
| Field programmable devices (FPDs),       | backward step, CD3.10:3–4                | Floating-point instructions, B-73–80 |
| C-78                                     | binary to decimal conversion, 249        | absolute value, B-73                 |
| Field programmable gate arrays (FPGAs),  | branch, 259                              | addition, B-73-74                    |
| C-78                                     | challenges, 280                          | comparison, B-74–75                  |

| Floating-point instructions (continued) | Fully associative caches            | implications for, A-24                    |
|-----------------------------------------|-------------------------------------|-------------------------------------------|
| conversion, B-75–76                     | block replacement strategies, 521   | interfaces and drivers, A-9               |
| desktop RISC, E-12                      | choice of, 520                      | unified, A-10–12                          |
| division, B-76                          | defined, 479                        | Graph coloring, CD2.15:11                 |
| load, B-76–77                           | memory block location, 480          | Graphics displays                         |
| move, B-77–78                           | misses, 483                         | computer hardware support, 17             |
| multiplication, B-78                    | See also Caches                     | LCD, 16                                   |
| negation, B-78–79                       | Fully connected networks, 661, 662  | Graphics logical pipeline, A-10           |
| SPARC, E-31                             | Function code, 97                   | Graphics processing units (GPUs),         |
| square root, B-79                       | Fused-multiply-add (FMA) operation, | 654–60                                    |
| store, B-79                             | 268, A-45–46                        | as accelerators, 654                      |
| subtraction, B-79-80                    |                                     | attribute interpolation, A-43-44          |
| truncation, B-80                        | G                                   | computing, CDA.11:4                       |
| Floating-point multiplication,          |                                     | defined, 44, 634, A-3                     |
| 255–59                                  | Game consoles, A-9                  | driver software, 655                      |
| binary, 256–57                          | Gates, C-3, C-8                     | evolution, A-5, CDA.11:2                  |
| illustrated, 258                        | AND, C-12, D-7                      | fallacies and pitfalls, A-72-75           |
| instructions, 259                       | defined, C-8                        | floating-point arithmetic, A-17,          |
| significands, 255                       | delays, C-46                        | A-41–46, A-74                             |
| steps, 255–56                           | mapping ALU control function to,    | future trends, CDA.11:5                   |
| Floating vectors, CD3.10:2              | D-4–7                               | GeForce 8-series generation, A-5          |
| Flow-sensitive information,             | NAND, C-8                           | general computation, A-73–74              |
| CD2.15:14                               | NOR, C-8, C-50                      | General Purpose (GPGPUs), 656, A-5        |
| Flushing instructions, 377, 378         | Gateways, CD6.11:6                  | CDA.11:3                                  |
| defined, 377                            | General Purpose GPUs (GPGPUs), 656, | graphics mode, A-6                        |
| exceptions and, 390                     | A-5, CDA.11:3                       | graphics trends, A-4                      |
| For loops, 157                          | General-purpose registers           | history, A-3–4                            |
| inner, CD2.15:25                        | architectures, CD2.20:2–3           | logical graphics pipeline, A-13–14        |
| SIMD and, CD7.14:2                      |                                     | main memory, 655                          |
| Formal parameters, B-16                 | embedded RISCs, E-5                 | mapping applications to, A-55–72          |
| •                                       | Generate                            |                                           |
| Format fields, D-31                     | defined, C-40                       | memory, 656<br>multilevel caches and, 655 |
| Fortran, CD2.20:6                       | example, C-44                       |                                           |
| Forwarding, 363–75                      | super, C-41                         | N-body applications, A-65–72              |
| ALU before, 368                         | Gigabytes, 23                       | NVIDIA architecture, 656–59               |
| control, 366                            | Global common subexpression         | parallelism, 655, A-76                    |
| datapath for hazard resolution, 370     | elimination, CD2.15:5               | parallel memory system, A-36–41           |
| defined, 336                            | Global memory, A-21, A-39           | performance doubling, A-4                 |
| functioning, 364–65                     | Global miss rates, 489              | perspective, 659–60                       |
| graphical representation, 337           | Global optimization, CD2.15:4–6     | programmable real-time, CDA.11:2–3        |
| illustrations, CD4.12:25–30             | code, CD2.15:6                      | programming, A-12–24                      |
| multiple results and, 339               | defined, CD2.15:4                   | programming interfaces to, 654, A-17      |
| multiplexors, 370                       | implementing, CD2.15:7–10           | real-time graphics, A-13                  |
| pipeline registers before, 368          | Global pointers, 118                | scalable, CDA.11:4–5                      |
| with two instructions, 336–37           | GPU computing                       | summary, A-76                             |
| Verilog implementation,                 | defined, A-5                        | See also GPU computing                    |
| CD4.12:3–5                              | visual applications, A-6–7          | Graphics shader programs, A-14–15         |
| Forward references, B-11                | See also Graphics processing        | Gresham's Law, 283, CD3.10:1              |
| Fractions, 244, 245, 246                | units (GPUs)                        | Grids, A-19                               |
| Frame buffer, 17                        | GPU system architectures, A-7-12    | Guard digits                              |
| Frame pointers, 119                     | graphics logical pipeline, A-10     | defined, 266                              |
| Front end, CD2.15:2                     | heterogeneous, A-7–9                | rounding with, 267                        |

| Н                                          | Hexadecimal numbers, 95–96             | rounding modes, 268                         |
|--------------------------------------------|----------------------------------------|---------------------------------------------|
|                                            | binary number conversion to, 96        | today, CD3.10:9                             |
| Half precision, A-42                       | defined, 95                            | See also Floating point                     |
| Halfwords, 126                             | High-level languages, 11–13, B-6       | IEEE 802.11, CD6.11:8-10                    |
| Handlers                                   | benefits, 13                           | with base stations, CD6.11:9                |
| defined, 513                               | computer architectures, CD2.20:4       | cellular telephony versus, CD6.11:10        |
| TLB miss, 514                              | defined, 12                            | defined, CD6.11:8                           |
| Handshaking protocol, 584                  | importance, 12                         | Wired Equivalent privacy, CD6.11:10         |
| Hard disks                                 | High-level optimizations, CD2.15:3–4   | IEEE 802.3, CD6.14:8                        |
| access times, 23                           | Hit rate, 454                          | I-format, 97                                |
| defined, 22                                | Hit time                               | If statements, 130                          |
| diameters, 23                              | cache performance and, 478             | If-then-else, 106                           |
| illustrated, 22                            | defined, 455                           | Immediate instructions, 86                  |
| read-write head, 22                        | Hit under miss, 541                    | Imprecise interrupts, 390, CD4.15:3         |
| Hardware                                   | Hold time, C-54                        | Index-out-of-bounds check, 110              |
| as hierarchical layer, 10                  | Horizontal microcode, D-32             | Induction variable elimination, CD2.15:6    |
| language of, 11–13                         | Hot-swapping, 605                      | Inheritance, CD2.15:14                      |
| operations, 77–80                          | Hubs, CD6.11:6, CD6.11:7               | In-order commit, 400                        |
| supporting procedures in, 112–22           | Hybrid hard disks, 581                 | Input devices, 15                           |
| synthesis, C-21                            | Tiyotta hara dioxo, 501                | Inputs, 318                                 |
| translating microprograms to, D-28–32      | 1                                      | Instances, CD2.15:14                        |
| virtualizable, 527                         | •                                      | Instruction count, 35, 36                   |
| Hardware-based speculation, 400            | IBM 360/85, CD5.13:6                   | Instruction decode/register file read stage |
| Hardware description languages             | IBM 370, CD6.14:2                      | control line, 361                           |
| defined, C-20                              | IBM 701, CD1.10:4                      | load instruction, 348                       |
| using, C-20–26                             | IBM 7030, CD4.15:1                     | store instruction, 352                      |
| VHDL, C-20–20                              | IBM ALOG, CD3.10:6                     | Instruction execution illustrations,        |
| See also Verilog                           | IBM Blue Genie, CD7.14:8–9             | CD4.12:16–30                                |
| Hardware multithreading, 645–48            | IBM Cell QS20                          | clock cycles 1 and 2, CD4.12:20             |
| •                                          | base versus fully optimized            | clock cycles 3 and 4, CD4.12:21             |
| coarse-grained, 645–46<br>defined, 645     | performance, 683                       | clock cycles 5 and 6, CD4.12:22             |
| fine-grained, 645, 647                     | characteristics, 677                   | clock cycles 7 and 8, CD4.12:23             |
|                                            | defined, 679                           | clock cycle 9, CD4.12:24                    |
| options, 646<br>simultaneous, 646–48       | illustrated, 676                       | examples, CD4.12:19–24                      |
| Harvard architecture, CD1.10:3             | LBMHD performance, 682                 | forwarding, CD4.12:25,                      |
| Hazard detection units, 372                | roofline model, 678                    | CD4.12:26–27                                |
| functions, 373                             | SpMV performance, 681                  | no hazard, CD4.12:16–19                     |
| pipeline connections for, 373              | IBM Personal Computer, CD1.10:7,       | pipelines with stalls and forwarding,       |
|                                            | CD2.20:5                               | CD4.12:25, CD4.12:28–30                     |
| Hazards, 335–43<br>control, 339–43, 375–84 | IBM System/360 computers, CD1.10:5,    | Instruction fetch stage                     |
|                                            | CD3.10:4, CD3.10:5, CD5.13:5           | control line, 361                           |
| data, 336–39, 363–75                       | IBM z/VM, CD5.13:7                     | load instruction, 348                       |
| defined, 335                               |                                        | store instruction, 352                      |
| forwarding and, 371                        | ID stage<br>branch execution in, 378   | Instruction formats                         |
| structural, 335–36, 352                    | load instructions, 349                 | ARM, 164                                    |
| See also Pipelining                        |                                        | defined, 95                                 |
| Heap                                       | store instruction in, 349              | desktop/server RISC architectures,          |
| allocating space on, 120–22                | IEEE 754 floating-point standard, 246, | E-7                                         |
| defined, 120                               | 247, CD3.10:7–9                        |                                             |
| Heterogeneous systems, A-4–5               | first chips, CD3.10:7–9                | embedded RISC architectures, E-8            |
| architecture, A-7–9                        | in GPU arithmetic, A-42–43             | I-type, 97                                  |
| defined, A-3                               | implementation, CD3.10:9               | J-type, 129                                 |

I-12 Index

| Instruction formats (continued)                      | logical operations, 102–5               | Integrated circuits (ICs)                                  |
|------------------------------------------------------|-----------------------------------------|------------------------------------------------------------|
| jump instruction, 328                                | M32R, E-40                              | cost, 46                                                   |
| MIPS, 164                                            | memory access, A-33-34                  | defined, 26                                                |
| R-type, 97, 319                                      | memory-reference, 301                   | manufacturing process, 45                                  |
| x86, 173                                             | MIPS-16, E-40-42                        | very large-scale (VLSIs), 26                               |
| Instruction latency, 408                             | MIPS-64, E-25–27                        | See also specific chips                                    |
| Instruction-level parallelism (ILP)                  | multiplication, 235, B-53-54            | Integrated Data Store (IDS), CD6.14:4                      |
| compiler exploitation, CD4.15:4–5                    | negation, B-54                          | Intel IA-64 architecture, CD4.15:4                         |
| defined, 41, 391                                     | nop, 373                                | Intel Nehalem                                              |
| exploitation, increasing, 402                        | PA-RISC, E-34–36                        | address translation for, 540                               |
| See also Parallelism                                 | performance, 33–34                      | caches, 541                                                |
| Instruction mix, 37, CD1.10:9                        | pipeline sequence, 372                  | die processor photo, 539                                   |
| Instructions, 74–221                                 | PowerPC, E-12–13, E-32–34               | memory hierarchies, 540–43                                 |
| add immediate, 86                                    | PTX, A-31, A-32                         | miss penalty reduction techniques,                         |
| addition, 226, B-51                                  | remainder, B-55                         | 541–43                                                     |
| Alpha, E-27–29                                       | representation in computer, 94–101      | TLB hardware for, 540                                      |
| arithmetic-logical, 308, B-51–57                     | restartable, 513                        | Intel Paragon, CD7.14:7                                    |
| ARM, 161–65, E-36–37                                 | resuming, 516                           | Intel Threading Building Blocks, A-60                      |
| assembly, 80                                         | R-type, 308–9                           | Intel Xeon e5345                                           |
| basic block, 108–9                                   | shift, B-55–56                          | base versus fully optimized                                |
| branch, B-59–63                                      | SPARC, E-29–32                          | performance, 683                                           |
| cache-aware, 547                                     | store, 85, B-68–70                      | characteristics, 677                                       |
| comparison, B-57–59                                  | store conditional, 138–39               | defined, 677                                               |
| conditional branch, 105                              | subtraction, 226, B-56–57               | illustrated, 677                                           |
| conditional move, 383                                | SuperH, E-39–40                         | LBMHD performance, 682                                     |
| constant-manipulating, B-57                          | thread, A-30–31                         | roofline model, 678                                        |
| conversion, B-75–76                                  | Thumb, E-38                             | SpMV performance, 681                                      |
| core, 282                                            | trap, B-64–66                           | Interference graphs, CD2.15:11                             |
| data movement, B-70–73                               | vector, 652                             | Interleaving, 472, 474                                     |
| data transfer, 82                                    | as words, 76                            | Internediate addressing, 132, 133                          |
| decision-making, 105–12                              | x86, 165–74                             | Internetworking, CD6.11:1–3                                |
| defined, 11, 76                                      | See also Arithmetic instructions;       | Interprocedural analysis, CD2.15:13                        |
| desktop RISC conventions, E-12                       | MIPS; Operands                          | Interrupt-driven I/O, 589                                  |
| division, B-52–53                                    | Instruction set architecture            | Interrupt enable, 512                                      |
| as electronic signals, 94                            | ARM, 161–65                             | Interrupt chable, 312 Interrupt handlers, B-33             |
| embedded RISC conventions, E-15                      | branch address calculation, 310         | Interrupt nandlers, B-33 Interrupt priority levels (IPLs), |
| encoding, 98                                         | defined, 21, 54                         | 590–92                                                     |
| exception and interrupt, B-80                        | history, 179                            | defined, 591                                               |
| exclusive OR, B-57                                   | maintaining, 54                         |                                                            |
|                                                      | protection and, 528–29                  | higher, 592                                                |
| fetching, 309<br>fields, 95                          | thread, A-31–34                         | Interrupts<br>defined, 227, 385                            |
|                                                      | virtual machine support, 527–28         |                                                            |
| floating point (x86) 273                             | Instruction sets                        | event types and, 385                                       |
| floating-point (x86), 273<br>flushing, 377, 378, 390 |                                         | exceptions versus, 384–85                                  |
| e                                                    | ARM, 383                                | imprecise, 390, CD4.15:3                                   |
| immediate, 86                                        | design for pipelining, 335              | instructions, B-80                                         |
| introduction to, 76–77                               | MIPS, 77, 178, 279                      | precise, 390                                               |
| I/O, 589                                             | MIPS-32, 281                            | vectored, 386                                              |
| jump, 111, 113, B-63–64                              | NVIDIA GeForce 8800, A-49               | Intrinsity FastMATH processor,                             |
| left-to-right flow, 346                              | Pseudo MIPS, 281                        | 468–70                                                     |
| load, 83, B-66–68                                    | x86 growth, 176                         | caches, 469                                                |
| load linked, 138                                     | Instructions per clock cycle (IPC), 391 | data miss rates, 470, 484                                  |

| defined, 468                  | operating system responsibilities    | L                                   |
|-------------------------------|--------------------------------------|-------------------------------------|
| read processing, 506          | and, 587–88                          |                                     |
| TLB, 504                      | organization, 585                    | Labels                              |
| write-through processing, 506 | peak transfer rate, 617              | global, B-10, B-11                  |
| Inverted page tables, 500     | performance, 618                     | local, B-11                         |
| I/O, B-38–40, CD6.14:1–8      | power evaluation, 611–12             | LAPACK, 271                         |
| bandwidth, 618                | weakest link, 598                    | Laptop computers, 18                |
| chip sets, 586                | Issue packets, 393                   | Large-scale multiprocessors,        |
| coherence problem for, 595    | _                                    | CD7.14:6–7, CD7.14:8–9              |
| controllers, 593, 615         | J                                    | Latches                             |
| future directions, 618        |                                      | defined, C-51                       |
| instructions, 589             | Java                                 | D latch, C-51, C-52                 |
| interrupt-driven, 589         | bytecode, 147                        | Latency                             |
| memory-mapped, 588, B-38      | bytecode architecture, CD2.15:16     | constraints, 598                    |
| parallelism and, 599–606      | characters in, 126–27                | instruction, 408                    |
| performance, 572              | compiling in, CD2.15:18-19           | memory, A-74–75                     |
| performance measures, 596–98  | goals, 146                           | pipeline, 344                       |
| processor communication,      | interpreting, 148, 161, CD2.15:14–15 | rotational, 576                     |
| 589–90                        | keywords, CD2.15:20                  | use, 395, 396                       |
| rate, 596, 610, 611           | method invocation in,                | Lattice Boltzmann Magneto-          |
| requests, 572, 618            | CD2.15:19-20                         | Hydrodynamics (LBMHD),              |
| standards, 584                | pointers, CD2.15:25                  | 680–82                              |
| system performance impact,    | primitive types, CD2.15:25           | defined, 680                        |
| 599–600                       | programs, starting, 146–48           | optimizations, 681–82               |
| systems, 570                  | reference types, CD2.15:25           | performance, 682                    |
| transactions, 583             | sort algorithms, 157                 | Leaf procedures                     |
| I/O benchmarks, 596–97        | strings in, 126–27                   | defined, 116                        |
| file system, 597–98           | translation hierarchy, 148           | example, 126                        |
| transaction processing,       | while loop compilation in,           | See also Procedures                 |
| 596–97                        | CD2.15:17–18                         | Least recently used (LRU)           |
| Web, 597–98                   | Java Virtual Machine (JVM), 147,     | as block replacement strategy, 521  |
| See also Benchmarks           | CD2.15:15                            | defined, 485                        |
| I/O devices                   | Job-level parallelism, 632           | pages, 499                          |
| characteristics, 571          | J-type instruction format, 129       | Least significant bits, C-32        |
| commands to, 588–89           | Jump instructions, 312               | defined, 88                         |
| diversity, 571                | branch instruction versus, 328       | SPARC, E-31                         |
| expandability, 572            | control and datapath for, 329        | Left-to-right instruction flow, 346 |
| illustrated, 570              | implementing, 328                    | Level-sensitive clocking, C-74,     |
| interfacing, 586–95           | instruction format, 328              | C-75–76                             |
| maximum number, 617           | list of, B-63–64                     | defined, C-74                       |
| multiple paths to, 618        | MIPS-64, E-26                        | two-phase, C-75                     |
| priorities, 590–92            | Just In Time (JIT) compilers,        | Lines. See Blocks                   |
| reads/writes to, 572          | 148, 687                             | Linkers, 142–45, B-18–19            |
| transfers, 585, 592–93        |                                      | defined, 142, B-4                   |
| I/O interconnects             | K                                    | executable files, 142, B-19         |
| function, 583                 |                                      | function illustration, B-19         |
| of x86 processors, 584–86     | Karnaugh maps, C-18                  | steps, 142                          |
| I/O systems                   | Kernel mode, 509                     | using, 143–45                       |
| design, 598–99                | Kernels                              | Linking object files, 143–45        |
| design example, 609–11        | CUDA, A-19, A-24                     | Linpack, 664, CD3.10:3              |
| history, 618                  | defined, A-19                        | Liquid crystal displays (LCDs), 16  |

I-14 Index

| LISP, SPARC support, E-30               | Local miss rates, 489             | Machine language                      |
|-----------------------------------------|-----------------------------------|---------------------------------------|
| Little-endian byte order, B-43          | Local optimization, CD2.15:4–6    | branch offset in, 131–32              |
| Live range, CD2.15:10                   | defined, CD2.15:4                 | decoding, 134                         |
| Livermore Loops, CD1.10:10              | implementing, CD2.15:7            | defined, 11, 95, B-3                  |
| Load balancing, 637–38                  | See also Optimization             | floating-point, 260                   |
| Loaders, 145                            | Locks, 639                        | illustrated, 12                       |
| Loading, B-19–20                        | Lock synchronization, 137         | MIPS, 100                             |
| Load instructions                       | Logic Logic                       | SRAM, 20                              |
| access, A-41                            | address select, D-24, D-25        | translating MIPS assembly language    |
| base register, 319                      | ALU control, D-6                  | into, 98–99                           |
| block, 165                              | combinational, 306, C-5, C-9–20   | Macros                                |
| compiling with, 85                      | components, 305                   | defined, B-4                          |
| datapath in operation for, 325          | control unit equations, D-11      | example, B-15–17                      |
| defined, 83                             | design, 303–7, C-1–79             | use of, B-15                          |
| details, B-66–68                        | equations, C-7                    | Magnetic disks. See Hard disks        |
| EX stage, 350                           | minimization, C-18                | Magnetic tapes, 615–16                |
| floating-point, B-76–77                 | programmable array (PAL), C-78    | defined, 23                           |
| halfword unsigned, 126                  | sequential, C-5, C-56–58          | use history, 615–16                   |
| ID stage, 349                           | two-level, C-11–14                | Main memory, 493                      |
| IF stage, 349                           | Logical operations, 102–5         | defined, 21                           |
| linked, 138, 139                        | AND, 103–4, B-52                  | page tables, 501                      |
| list of, B-66–68                        | ARM, 165                          | physical addresses, 492, 493          |
| load byte unsigned, 124                 | defined, 102–5                    | See also Memory                       |
| load half, 126                          | desktop RISC, E-11                | Mapping applications, A-55–72         |
| load upper immediate, 128, 129          | embedded RISC, E-14               | Mark computers, CD1.10:3              |
| MEM stage, 351                          | MIPS, B-51–57                     | Mealy machine, 532, C-68, C-71, C-72  |
| pipelined datapath in, 355              | NOR, 104–5, B-54                  | Mean time between failures            |
| signed, 124                             | NOT, 104, B-55                    | (MTBF), 573                           |
| unit for implementing, 311              | OR, 104, B-55                     | Mean time to failure (MTTF), 573, 574 |
| unsigned, 124                           | shifts, 102                       | fallacies, 613                        |
| WB stage, 351                           | Long-haul networks, CD6.11:5      | ratings, 600                          |
| See also Store instructions             | Long instruction word (LIW),      | Mean time to repair (MTTR), 573, 574  |
| Load-store architectures, CD2.20:2      | CD4.15:4                          | Memory                                |
| Load-use data hazard, 338, 377          | Lookup tables (LUTs), C-79        | addresses, 91                         |
| Load-use stalls, 377                    | Loops, 107–8                      | affinity, 680, 681                    |
| Load word, 83, 85                       | conditional branches in, 130      | atomic, A-21                          |
| Local area networks (LANs), CD6.11:5–8, | defined, 107                      | bandwidth, 471, 472                   |
| CD6.14:8                                | for, 157, CD2.15:25               | cache, 20, 457–92                     |
| defined, 25                             | prediction and, 380               | CAM, 485                              |
| Ethernet, CD6.11:5–6                    | test, 158, 159                    | constant, A-40                        |
| hubs, CD6.11:6, CD6.11:7                | while, compiling, 107–8           | control, D-26                         |
| routers, CD6.11:6                       | Loop unrolling                    | defined, 17                           |
| switches, CD6.11:6-7                    | defined, 397, CD2.15:3            | DRAM, 18–19, 453, 471, 473, C-63–65   |
| wireless, CD6.11:8-11                   | for multiple-issue pipelines, 397 | efficiency, 642                       |
| See also Networks                       | register renaming and, 397        | flash, 22, 23, 580-82, CD6.14:4       |
| Locality                                |                                   | global, A-21, A-39                    |
| principle, 452, 453                     | M                                 | GPU, 656                              |
| spatial, 452-53, 456                    |                                   | instructions, datapath for, 314       |
| temporal, 452, 453, 456                 | M32R, E-15, E-40                  | layout, B-21                          |
| Local labels, B-11                      | Machine code, 95                  | local, A-21, A-40                     |
| Local memory, A-21, A-40                | Machine instructions, 95          | main, 21                              |

| nonvolatile, 21                     | structure diagram, 456                     | compiling C assignment statements        |
|-------------------------------------|--------------------------------------------|------------------------------------------|
| operands, 82–83                     | variance, 491                              | into, 79                                 |
| parallel system, A-36–41            | virtual memory, 492–517                    | compiling complex C assignment           |
| read-only (ROM), C-14-16            | Memory-mapped I/O                          | into, 79–80                              |
| SDRAM, 473                          | defined, 588                               | constant-manipulating instructions,      |
| secondary, 22                       | use of, B-38                               | B-57                                     |
| shared, A-21, A-39-40               | Memory-stall clock cycles, 475, 476        | control registers, 511                   |
| spaces, A-39                        | Message passing                            | control unit, D-10                       |
| SRAM, C-58–62                       | defined, 641                               | CPU, B-46                                |
| stalls, 478                         | multiprocessors, 641-45                    | divide in, 241–42                        |
| technologies for building, 25–26    | Metastability, C-76                        | exceptions in, 385–86                    |
| texture, A-40                       | Methods                                    | fields, 96–97                            |
| usage, B-20–22                      | defined, CD2.15:14                         | floating-point instructions, 259-61      |
| virtual, 492–517                    | invoking in Java, CD2.15:19-20             | FPU, B-46                                |
| volatile, 21                        | static, B-20                               | instruction classes, 179                 |
| Memory access instructions, A-33-34 | Microarchitectures                         | instruction encoding, 98, 135, B-49      |
| Memory access stage                 | AMD Opteron X4 (Barcelona), 405            | instruction formats, 136, 164, B-49-51   |
| control line, 362                   | defined, 404                               | instruction set, 77, 178, 279            |
| load instruction, 350               | Microcode                                  | jump instructions, B-63-66               |
| store instruction, 352              | assembler, D-30                            | logical instructions, B-51–57            |
| Memory consistency model, 538       | control unit as, D-28                      | machine language, 100                    |
| Memory elements, C-50–58            | defined, D-27                              | memory addresses, 84                     |
| clocked, C-51                       | dispatch ROMs, D-30-31                     | memory allocation for program and        |
| D flip-flop, C-51, C-53             | field translation, D-29                    | data, 120                                |
| D latch, C-52                       | horizontal, D-32                           | multiply in, 235                         |
| DRAMs, C-63–67                      | vertical, D-32                             | opcode map, B-50                         |
| flip-flop, C-51                     | Microinstructions, D-31                    | operands, 78                             |
| hold time, C-54                     | Microprocessors                            | Pseudo, 280, 281                         |
| latch, C-51                         | design shift, 633                          | register conventions, 121                |
| setup time, C-53, C-54              | multicore, 8, 41, 632                      | static multiple issue with, 394–97       |
| SRAMs, C-58–62                      | Microprograms                              | MIPS-16, E-15–16                         |
| unclocked, C-51                     | as abstract control representation, D-30   | 16-bit instruction set, E-41–42          |
| Memory hierarchies                  | translating to hardware, D-28–32           | immediate fields, E-41                   |
| block (or line), 454                | Migration, 536                             | instructions, E-40–42                    |
| cache performance, 475–92           | Million instructions per second (MIPS), 53 | MIPS core instruction changes, E-42      |
| caches, 457–75                      | Minterms                                   | PC-relative addressing, E-41             |
| common framework, 518–25            | defined, C-12, D-20                        | MIPS-32 instruction set, 281             |
| defined, 453                        | in PLA implementation, D-20                | MIPS-64 instructions, E-25–27            |
| design challenges, 525              | MIP-map, A-44                              | conditional procedure call               |
| development, CD5.13:5–7             | MIPS, 78, 98–99, B-45–80                   | instructions, E-27                       |
| exploiting, 450–548                 | addressing for 32-bit immediates,          | constant shift amount, E-25              |
| inclusion, 542                      | 128–36                                     | jump/call not PC-relative, E-26          |
| level pairs, 455                    | addressing modes, B-45-47                  | move to/from control registers, E-26     |
| multiple levels, 454                | arithmetic core, 280                       | nonaligned data transfers, E-25          |
| overall operation of, 507           | arithmetic instructions, 77, B-51–57       | NOR, E-25                                |
| parallelism and, 534–38             | ARM similarities, 162                      | parallel single precision floating-point |
| pitfalls, 543–47                    | assembler directive support, B-47-49       | operations, E-27                         |
| program execution time and, 491     | assembler syntax, B-47–49                  | reciprocal and reciprocal square root,   |
| quantitative design parameters, 518 | assembly instruction, mapping, 95          | E-27                                     |
| reliance on, 455                    | branch instructions, B-59–63               | SYSCALL, E-25                            |
| structure, 454                      | comparison instructions, B-57-59           | TLB instructions, E-26–27                |
|                                     |                                            |                                          |

| MIPS core                             | Multilevel caches                        | product, 230                        |
|---------------------------------------|------------------------------------------|-------------------------------------|
| architecture, 243                     | complications, 489                       | sequential version, 231-33          |
| arithmetic/logical instructions       | defined, 475, 489                        | signed, 234                         |
| not in, E-21, E-23                    | miss penalty, reducing, 487-91           | See also Arithmetic                 |
| common extensions to, E-20-25         | performance of, 487–88                   | Multiplier, 230                     |
| control instructions not in, E-21     | summary, 491–92                          | Multiply-add (MAD), A-42            |
| data transfer instructions not in,    | See also Caches                          | Multiply algorithm, 234             |
| E-20, E-22                            | Multimedia arithmetic, 227–28            | Multiprocessors                     |
| floating-point instructions           | Multimedia extensions                    | benchmarks, 664-66                  |
| not in, E-22                          | desktop/server RISCs, E-16-18            | bus-based coherent, CD7.14:6        |
| instruction set, 282, 300-303,        | vector versus, 653                       | defined, 632                        |
| E-9-10                                | Multiple-clock-cycle pipeline            | historical perspective, 688         |
| Mirroring, 602                        | diagrams, 356                            | large-scale, CD7.14:6-7, CD7.14:8-9 |
| Miss penalty                          | defined, 356                             | message-passing, 641-45             |
| defined, 455                          | five instructions, 357                   | multithreaded architecture,         |
| determination, 464                    | illustrated, 357                         | A-26–27, A-35–36                    |
| multilevel caches, reducing, 487-91   | Multiple dimension arrays, 266           | organization, 631, 641              |
| reduction techniques, 541-43          | Multiple instruction multiple data       | for performance, 686–87             |
| Miss rates                            | (MIMD), 659                              | shared-memory, 633, 638-40          |
| block size versus, 465                | defined, 648                             | software, 632                       |
| data cache, 519                       | first multiprocessor, CD7.14:3           | TFLOPS, CD7.14:5                    |
| defined, 454                          | Multiple instruction single data (MISD), | UMA, 639                            |
| global, 489                           | 649                                      | Multistage networks, 662            |
| improvement, 464                      | Multiple issue, 391–400                  | Multithreaded multiprocessor        |
| Intrinsity FastMATH processor, 470    | code scheduling, 396                     | architecture, A-25–36               |
| local, 489                            | defined, 391                             | conclusion, A-36                    |
| miss sources, 524                     | dynamic, 392, 397-400                    | ISA, A-31–34                        |
| split cache, 470                      | issue packets, 393                       | massive multithreading, A-25-26     |
| Miss under miss, 541                  | loop unrolling and, 397                  | multiprocessor, A-26–27             |
| Modules, B-4                          | processors, 391, 392                     | multiprocessor comparison,          |
| Moore machines, 532, C-68, C-71, C-72 | static, 392, 393–97                      | A-35–36                             |
| Moore's law, 654, A-72–73             | throughput and, 401                      | SIMT, A-27–30                       |
| Most significant bit                  | Multiplexors, C-10                       | special function units (SFUs), A-35 |
| 1-bit ALU for, C-33                   | controls, 531                            | streaming processor (SP), A-34      |
| defined, 88                           | in datapath, 320                         | thread instructions, A-30–31        |
| Motherboards, 17                      | defined, 302                             | threads/thread blocks management,   |
| Mouse anatomy, 16                     | forwarding, control values, 370          | A-30                                |
| Move instructions, B-70–73            | selector control, 314                    | Multithreading, A-25–26             |
| coprocessor, B-71-72                  | two-input, C-10                          | coarse-grained, 645–46              |
| details, B-70–73                      | Multiplicand, 230                        | defined, 634                        |
| floating-point, B-77-78               | Multiplication, 230–36                   | fine-grained, 645, 647              |
| MS-DOS, CD5.13:10–11                  | fast, hardware, 236                      | hardware, 645–48                    |
| Multicore multiprocessors, 41         | faster, 235                              | simultaneous (SMT), 646-48          |
| benchmarking with roofline model,     | first algorithm, 232                     | Must-information, CD2.15:14         |
| 675–84                                | floating-point, 255–58, B-78             | Mutual exclusion, 137               |
| characteristics, 677                  | hardware, 231–33                         |                                     |
| defined, 8, 632                       | instructions, 235, B-53-54               | N                                   |
| system organization, 676              | in MIPS, 235                             |                                     |
| two sockets, 676                      | multiplicand, 230                        | Name dependence, 397                |
| MULTICS (Multiplexed Information and  | multiplier, 230                          | NAND flash memory, CD6.14:4         |
| Computing Service), CD5.13:8–9        | operands, 230                            | NAND gates, C-8                     |
|                                       | 2                                        | _                                   |

| NAS (NASA Advanced Supercomputing),  | Nonblocking caches, 403, 541          | format, B-13–14                        |
|--------------------------------------|---------------------------------------|----------------------------------------|
| 666                                  | Nonuniform memory access              | header, 141, B-13                      |
| N-body                               | (NUMA), 639                           | linking, 143–45                        |
| all-pairs algorithm, A-65            | Nonvolatile memory, 21                | relocation information, 141            |
| GPU simulation, A-71                 | Nonvolatile storage, 575              | static data segment, 141               |
| mathematics, A-65-67                 | Nops, 373                             | symbol table, 141, 142                 |
| multiple threads per body, A-68-69   | NOR flash memory, 581, CD6.14:4       | text segment, 141                      |
| optimization, A-67                   | NOR gates, C-8                        | Object-oriented languages              |
| performance comparison, A-69–70      | cross-coupled, C-50                   | brief history, CD2.20:7                |
| results, A-70–72                     | D latch implemented with, C-52        | defined, 161, CD2.15:14                |
| shared memory use, A-67-68           | NOR operation, 104–5, B-54, E-25      | See also Java                          |
| Negation instructions, B-54, B-78–79 | North bridge, 584                     | One's complement, 94, C-29             |
| Negation shortcut, 91–92             | NOT operation, 104, B-55, C-6         | Opcodes                                |
| Nested procedures, 116–18            | No write allocation, 467              | control line setting and, 323          |
| compiling recursive procedure        | Numbers                               | defined, 97, 319                       |
| showing, 117–18                      | binary, 87                            | OpenGL, A-13                           |
| defined, 116                         | computer versus real-world, 269       | OpenMP (Open MultiProcessing), 666     |
| Network of Workstations, CD7.14:7–8  | decimal, 87, 90                       | Open Systems Interconnect (OSI) model, |
| Networks, 24–25, 612–13, CD6.11:1–11 | denormalized, 270                     | CD6.11:2                               |
| advantages, 24                       | hexadecimal, 95–96                    | Operands, 80–87                        |
| bandwidth, 661                       | signed, 87–94                         | 32-bit immediate, 128–29               |
| characteristics, CD6.11:1            | unsigned, 87–94                       | adding, 225                            |
| crossbar, 662                        | NVIDIA GeForce 3, CDA.11:1            | arithmetic instructions, 80            |
| fully connected, 661, 662            | NVIDIA GeForce 8800, A-46-55,         | compiling assignment when in           |
| local area (LANs), 25, CD6.11:5-8,   | CDA.11:3                              | memory, 83                             |
| CD6.14:8                             | all-pairs N-body algorithm, A-71      | constant, 86–87                        |
| long-haul, CD6.11:5                  | dense linear algebra computations,    | division, 237                          |
| multistage, 662                      | A-51-53                               | floating-point, 260                    |
| OSI model layers, CD6.11:2           | FFT performance, A-53                 | memory, 82–83                          |
| peer-to-peer, CD6.11:2               | instruction set, A-49                 | MIPS, 78                               |
| performance, CD6.11:7-8              | performance, A-51                     | multiplication, 230                    |
| protocol families/suites, CD6.11:1   | rasterization, A-50                   | shifting, 164                          |
| switched, CD6.11:5                   | ROP, A-50-51                          | See also Instructions                  |
| wide area (WANs), 25, CD6.14:7-8     | scalability, A-51                     | Operating systems                      |
| Network topologies, 660–63           | sorting performance, A-54-55          | brief history, CD5.13:8-11             |
| implementing, 662-63                 | special function approximation        | defined, 10                            |
| multistage, 663                      | statistics, A-43                      | disk access scheduling pitfall, 616-17 |
| Newton's iteration, 266              | special function unit (SFU), A-50     | encapsulation, 21                      |
| Next state                           | streaming multiprocessor (SM),        | Operations                             |
| nonsequential, D-24                  | A-48-49                               | atomic, implementing, 138              |
| sequential, D-23                     | streaming processor, A-49-50          | hardware, 77–80                        |
| Next-state function, 531, C-67       | streaming processor array (SPA), A-46 | logical, 102–5                         |
| defined, 531                         | texture/processor cluster (TPC),      | x86 integer, 168–71                    |
| implementing, with sequencer,        | A-47-48                               | Optical disks                          |
| D-22-28                              | NVIDIA GPU architecture, 656-59       | defined, 23                            |
| Next-state outputs, D-10, D-12-13    |                                       | technology, 24                         |
| example, D-12–13                     | 0                                     | Optimization                           |
| implementation, D-12                 |                                       | class explanation, CD2.15:13           |
| logic equations, D-12–13             | Object files, 141, B-4                | compiler, 160                          |
| truth tables, D-15                   | debugging information, 142            | control implementation, D-27-28        |
| Nonblocking assignment, C-24         | defined, B-10                         | global, CD2.15:4–6                     |

I-18

|                                   | D. 11.11                             |                                      |
|-----------------------------------|--------------------------------------|--------------------------------------|
| Optimization (continued)          | Parallelism, 41, 391–403             | instructions, E-34–36                |
| high-level, CD2.15:3              | data-level, 649                      | load and clear instructions, E-36    |
| local, CD2.15:4–6, CD2.15:7       | debates, CD7.14:4-6                  | multiply/add and multiply/           |
| manual, 160                       | GPUs and, 655, A-76                  | subtract, E-36                       |
| OR operation, 104, B-55, C-6      | instruction-level, 41, 391, 402      | nullification, E-34                  |
| Out-of-order execution            | I/O and, 599–606                     | nullifying branch option, E-25       |
| defined, 400                      | job-level, 632                       | store bytes short, E-36              |
| performance complexity, 489       | memory hierarchies and, 534-38       | synthesized multiply and divide,     |
| processors, 403                   | multicore and, 648                   | E-34-35                              |
| Output devices, 15                | multiple issue, 391–400              | Parity, 602                          |
| Overflow                          | multithreading and, 648              | bit-interleaved, 602                 |
| defined, 89, 245                  | performance benefits, 43             | block-interleaved, 602-04            |
| detection, 226                    | process-level, 632                   | code, C-65                           |
| exceptions, 387                   | subword, E-17                        | disk, 603                            |
| floating point, 245               | task, A-24                           | distributed block-interleaved, 603-4 |
| occurrence, 90                    | thread, A-22                         | PARSEC (Princeton Application        |
| saturation and, 227–28            | Parallel memory system, A-36–41      | Repository for Shared-Memory         |
| subtraction, 226                  | caches, A-38                         | Computers), 666                      |
| , ,                               | constant memory, A-40                | Pass transistor, C-63                |
| P                                 | DRAM considerations, A-37–38         | PCI-Express (PCIe), A-8              |
| •                                 | global memory, A-39                  | PC-relative addressing, 130, 133     |
| Packed floating-point format, 274 | load/store access, A-41              | Peak floating-point performance, 668 |
| Page faults, 498                  | local memory, A-40                   | Peak transfer rate, 617              |
| for data access, 513              | memory spaces, A-39                  | Peer-to-peer networks, CD6.11:2      |
| defined, 493, 494                 | MMU, A-38–39                         | Pentium bug morality play, 276–79    |
| handling, 495, 510–16             | ROP, A-41                            | Performance, 26–38                   |
| virtual address causing, 514      | shared memory, A-39–40               | assessing, 26–27                     |
| See also Virtual memory           | surfaces, A-41                       | classic CPU equation, 35–37          |
| Pages                             | texture memory, A-40                 | components, 37                       |
| defined, 493                      | See also Graphics processing units   | CPU, 30–32                           |
| dirty, 501                        | (GPUs)                               | defining, 27–30                      |
| finding, 496                      | Parallel-processing programs, 634–38 | equation, using, 34                  |
| LRU, 499                          | creation difficulty, 634–38          | improving, 32–33                     |
| offset, 494                       | defined, 632                         | instruction, 33–34                   |
| physical number, 494              | for message passing, 642–43          | measuring, 30–32, CD1.10:9           |
| placing, 496                      | for shared address space, 639–40     | networks, CD6.11:7–8                 |
| size, 495                         | use of, 686                          | program, 38                          |
| virtual number, 494               | Parallel reduction, A-62             | ratio, 30                            |
| See also Virtual memory           | Parallel scan, A-60–63               | relative, 29                         |
| Page tables, 520                  | CUDA template, A-61                  | response time, 28, 29                |
| defined, 496                      | defined, A-60                        | sorting, A-54–55                     |
| illustrated, 499                  | inclusive, A-60                      | throughput, 28                       |
| indexing, 497                     | tree-based, A-62                     | time measurement, 30                 |
| inverted, 500                     | Parallel software, 633               | Petabytes, 5                         |
| levels, 500–501                   | Paravirtualization, 547              | Physical addresses, 493              |
| main memory, 501                  | PA-RISC, E-14, E-17                  | defined, 492                         |
| · ·                               |                                      |                                      |
| register, 497                     | branch vectored, E-35                | mapping to, 494                      |
| storage reduction techniques,     | conditional branches, E-34, E-35     | space, 638, 640                      |
| 500–501                           | debug instructions, E-36             | Physically addressed caches, 508     |
| updating, 496                     | decimal operations, E-35             | Physical memory. See Main memory     |
| VMM, 529                          | extract and deposit, E-35            | Pipelined branches, 378              |

| Pipelined control, 359-63              | defined, 330                           | Polling, 589                        |
|----------------------------------------|----------------------------------------|-------------------------------------|
| control lines, 360, 361                | exceptions and, 386–91                 | Pop, 114                            |
| overview illustration, 375             | execution time and, 344                | Power                               |
| specifying, 361                        | fallacies, 407                         | clock rate and, 39                  |
| See also Control                       | hazards, 335–43                        | critical nature of, 55              |
| Pipelined datapaths, 344–58            | instruction set design for, 335        | efficiency, 402–3                   |
| with connected control signals, 362    | laundry analogy, 331                   | relative, 40                        |
| with control signals, 359              | overview, 330–44                       | PowerPC                             |
| corrected, 355                         | paradox, 331                           | algebraic right shift, E-33         |
| illustrated, 347                       | performance improvement, 335           | branch registers, E-32–33           |
| in load instruction stages, 355        | pitfall, 407–8                         | condition codes, E-12               |
| Pipelined dependencies, 364            | simultaneous executing instructions,   | instructions, E-12-13               |
| Pipeline registers                     | 344                                    | instructions unique to, E-31-33     |
| before forwarding, 368                 | speed-up formula, 333                  | load multiple/store multiple, E-33  |
| dependences, 366, 367                  | structural hazards, 335–36, 352        | logical shifted immediate, E-33     |
| forwarding unit selection, 371         | summary, 343                           | rotate with mask, E-33              |
| Pipelines                              | throughput and, 344                    | P + Q redundancy, 604               |
| AMD Opteron X4 (Barcelona), 404–6      | Pitfalls                               | Precise interrupts, 390             |
| branch instruction impact, 376         | address space extension, 545           | Prediction                          |
| effectiveness, improving, CD4.15:3–4   | associativity, 545                     | 2-bit scheme, 381                   |
| execute and address calculation stage, | defined, 51                            | accuracy, 380, 381                  |
| 350, 352                               | GPUs, A-74–75                          | dynamic branch, 380–83              |
| five-stage, 333, 348–50, 358           | ignoring memory system                 | loops and, 380                      |
| fixed-function graphics, CDA.11:1      | behavior, 544                          | steady-state, 380                   |
| graphic representation, 337,           | magnetic tape backups, 615–16          | Prefetching, 547, 680               |
| 356–58                                 | memory hierarchies, 543–47             | Primary memory. See Main memory     |
| instruction decode and register file   | moving functions to I/O                | Primitive types, CD2.15:25          |
| read stage, 348, 352                   | processor, 615                         | Priority levels, 590–92             |
| instruction fetch stage, 348, 352      | network feature provision, 614–15      | Procedure calls                     |
| instructions sequence, 372             | operating system disk accesses, 616–17 | convention, B-22–33                 |
| latency, 344                           | out-of-order processor                 | examples, B-27–33                   |
| memory access stage, 350, 352          | evaluation, 545                        | frame, B-23                         |
| multiple-clock-cycle diagrams, 356     | peak transfer rate performance, 617    | preservation across, 118            |
| performance bottlenecks, 402           | performance equation subset, 52–53     | Procedures, 112–22                  |
| single-clock-cycle diagrams, 356       | pipelining, 407–8                      | compiling, 114                      |
| stages, 333                            | pointer to automatic variables, 175    | compiling, showing nested procedure |
| static two-issue, 394                  | sequential word addresses, 175         | linking, 117–18                     |
| write-back stage, 350, 352             | simulating cache, 543–44               | defined, 112                        |
| Pipeline stalls, 338–39                | software development with              | execution steps, 112                |
| avoiding with code reordering,         | multiprocessors, 685                   | frames, 119                         |
| 338–39                                 | VMM implementation, 545–47             | leaf, 116                           |
| data hazards and, 371–74               | See also Fallacies                     | nested, 116–18                      |
| defined, 338                           | Pixel shader example, A-15–17          | recursive, 121, B-26–27             |
| insertion, 374                         | Pizza boxes, 607                       | for setting arrays to zero, 158     |
| load-use, 377                          | Pointers                               | sort, 150–55                        |
| as solution to control hazards, 340    | arrays versus, 157–61                  | strcpy, 124-25, 126                 |
| Pipelining, 330–44                     | frame, 119                             | string copy, 124–26                 |
| advanced, 402–3                        | global, 118                            | swap, 149-50                        |
| benefits, 331                          | incrementing, 159                      | Process identifiers, 510            |
| control hazards, 339–43                | Java, CD2.15:25                        | Process-level parallelism, 632      |
| data hazards, 336–39                   | stack, 114, 116                        | Processor-memory bus, 582           |
| · · · · · · · · · · · · · · · · · · ·  |                                        |                                     |

| Processors, 298–409                  | Program performance           | RAID. See Redundant arrays of           |
|--------------------------------------|-------------------------------|-----------------------------------------|
| control, 19                          | elements affecting, 38        | inexpensive disks                       |
| as cores, 41                         | understanding, 9              | RAMAC (Random Access Method             |
| datapath, 19                         | Programs                      | of Accounting and Control),             |
| defined, 14, 19                      | assembly language, 139        | CD6.14:1, CD6.14:2                      |
| dynamic multiple-issue, 392          | Java, starting, 146–48        | Rank units, 606, 607                    |
| I/O communication with, 589–90       | parallel-processing, 634–38   | Rasterization, A-50                     |
| multiple-issue, 391, 392             | starting, 139–48              | Raster operation (ROP) processors,      |
| out-of-order execution, 403, 489     | translating, 139–48           | A-12, A-41                              |
| performance growth, 42               | Propagate                     | fixed function, A-41                    |
| ROP, A-12, A-41                      | defined, C-40                 | GeForce 8800, A-50–51                   |
| speculation, 392–93                  | example, C-44                 | Raster refresh buffer, 17               |
| static multiple-issue, 392, 393-97   | super, C-41                   | Read-only memories (ROMs), C-14-16      |
| streaming, 657, A-34                 | Protected keywords, CD2.15:20 | control entries, D-16–17                |
| superscalar, 397, 398, 399-400, 646, | Protection                    | control function encoding, D-18-19      |
| CD4.15:4                             | defined, 492                  | defined, C-14                           |
| technologies for building, 25-26     | group, 602                    | dispatch, D-25                          |
| two-issue, 395                       | implementing, 508–10          | implementation, D-15–19                 |
| vector, 650-53                       | mechanisms, CD5.13:7          | logic function encoding, C-15           |
| VLIW, 394                            | VMs for, 526                  | overhead, D-18                          |
| Product, 230                         | Protocol families/suites      | PLAs and, C-15–16                       |
| Product of sums, C-11                | analogy, CD6.11:2-3           | programmable (PROM), C-14               |
| Program counters (PCs), 307          | defined, CD6.11:1             | total size, D-16                        |
| changing with conditional            | goal, CD6.11:2                | Read-stall cycles, 476                  |
| branch, 383                          | Protocol stacks, CD6.11:3     | Receive message routine, 641            |
| defined, 113, 307                    | Pseudodirect addressing, 133  | Receiver Control register, B-39         |
| exception, 509, 511                  | Pseudoinstructions            | Receiver Data register, B-38, B-39      |
| incrementing, 307, 309               | defined, 140                  | Recursive procedures, 121, B-26–27      |
| instruction updates, 348             | summary, 141                  | clone invocation, 116                   |
| Program libraries, B-4               | Pseudo MIPS                   | defined, B-26                           |
| Programmable array logic (PAL), C-78 | defined, 280                  | stack in, B-29–30                       |
| Programmable logic arrays (PLAs)     | instruction set, 281          | See also Procedures                     |
| component dots illustration, C-16    | Pthreads (POSIX threads), 666 | Reduced instruction set computer (RISC) |
| control function implementation,     | PTX instructions, A-31, A-32  | architectures, E-2–45, CD2.20:4,        |
| D-7, D-20–21                         | Public keywords, CD2.15:20    | CD4.15:3                                |
| defined, C-12                        | Push                          | group types, E-3–4                      |
| example, C-13–14                     | defined, 114                  | instruction set lineage, E-44           |
| illustrated, C-13                    | using, 116                    | See also Desktop and server RISCs;      |
| ROMs and, C-15–16                    | using, 110                    | Embedded RISCs                          |
| size, D-20                           | •                             | Reduction, 640                          |
| truth table implementation, C-13     | Q                             | Redundant arrays of inexpensive disks   |
| Programmable logic devices (PLDs),   | Quad words, 168               | (RAID), 600–606                         |
| C-78                                 | -                             | calculation of, 605                     |
| Programmable real-time graphics,     | Quicksort, 489, 490           | defined, 600                            |
| CDA.11:2–3                           | Quotient, 237                 |                                         |
|                                      | <b>D</b>                      | example illustration, 601               |
| Programmable ROMs (PROMs), C-14      | R                             | history, CD6.14:6–7                     |
| Programming languages                | D 0.73                        | PCI controller, 611                     |
| brief history of, CD2.20:6–7         | Race, C-73                    | popularity, 600                         |
| object-oriented, 161                 | Radix sort, 489, 490, A-63–65 | RAID 0, 601                             |
| variables, 81                        | CUDA code, A-64               | RAID 1, 602, CD6.14:6                   |
| See also specific languages          | implementation, A-63–65       | RAID $1 + 0,606$                        |

| RAID 2, 602, CD6.14:6 RAID 3, 602, CD6.14:6, CD6.14:7 RAID 4, 602–3, CD6.14:6 RAID 5, 603–4, CD6.14:6, CD6.14:7 RAID 6, 604 spread of, CD6.14:7 summary, 604–5 use statistics, CD6.14:7 Reference bit, 499 References absolute, 142 | temporary, 81, 115 Transmitter Control, B-39–40 Transmitter Data, B-40 usage convention, B-24 use convention, B-22 variables, 81 x86, 168 Relational databases, CD6.14:5 Relative performance, 29 Relative power, 40 Reliability, 573 | Rotational latency, 576 Rounding accurate, 266 bits, 268 defined, 266 with guard digits, 267 IEEE 754 modes, 268 Routers, CD6.11:6 Row-major order, 265 R-type instructions, 308–9 datapath for, 323 |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| forward, B-11                                                                                                                                                                                                                       | Relocation information, B-13, B-14                                                                                                                                                                                                    | datapath in operation for, 324                                                                                                                                                                       |
| types, CD2.15:25                                                                                                                                                                                                                    | Remainder                                                                                                                                                                                                                             | 6                                                                                                                                                                                                    |
| unresolved, B-4, B-18                                                                                                                                                                                                               | defined, 237                                                                                                                                                                                                                          | S                                                                                                                                                                                                    |
| Register addressing, 132, 133                                                                                                                                                                                                       | instructions, B-55                                                                                                                                                                                                                    | C-tti 227 20                                                                                                                                                                                         |
| Register allocation, CD2.15:10–12                                                                                                                                                                                                   | Reorder buffers, 399, 402, 403                                                                                                                                                                                                        | Saturation, 227–28                                                                                                                                                                                   |
| Register files, C-50, C-54–56                                                                                                                                                                                                       | Replication, 536                                                                                                                                                                                                                      | Scalable GPUs, CDA.11:4–5                                                                                                                                                                            |
| in behavioral Verilog, C-57<br>defined, 308, C-50, C-54                                                                                                                                                                             | Requested word first, 465<br>Reservation stations                                                                                                                                                                                     | SCALAPAK, 271                                                                                                                                                                                        |
| single, 314                                                                                                                                                                                                                         | buffering operands in, 400                                                                                                                                                                                                            | Scaling                                                                                                                                                                                              |
| two read ports implementation,                                                                                                                                                                                                      | defined, 399                                                                                                                                                                                                                          | strong, 637, 638<br>weak, 637                                                                                                                                                                        |
| C-55                                                                                                                                                                                                                                | Response time, 28, 29                                                                                                                                                                                                                 | Scientific notation                                                                                                                                                                                  |
| with two read ports/one write port,                                                                                                                                                                                                 | Restartable instructions, 513                                                                                                                                                                                                         | adding numbers in, 250                                                                                                                                                                               |
| C-55                                                                                                                                                                                                                                | Restorations, 573                                                                                                                                                                                                                     | defined, 244                                                                                                                                                                                         |
| write port implementation, C-56                                                                                                                                                                                                     | Return address, 113                                                                                                                                                                                                                   | for reals, 244                                                                                                                                                                                       |
| Register-memory architecture, CD2.20:2                                                                                                                                                                                              | Return from exception (ERET), 509                                                                                                                                                                                                     | Secondary memory, 22                                                                                                                                                                                 |
| Registers                                                                                                                                                                                                                           | R-format, 319                                                                                                                                                                                                                         | Sectors, 575                                                                                                                                                                                         |
| architectural, 404                                                                                                                                                                                                                  | ALU operations, 310                                                                                                                                                                                                                   | Seek time, 575                                                                                                                                                                                       |
| base, 83                                                                                                                                                                                                                            | defined, 97                                                                                                                                                                                                                           | Segmentation, 495                                                                                                                                                                                    |
| callee-saved, B-23                                                                                                                                                                                                                  | Ripple carry                                                                                                                                                                                                                          | Selector values, C-10                                                                                                                                                                                |
| caller-saved, B-23                                                                                                                                                                                                                  | adder, C-29                                                                                                                                                                                                                           | Semiconductors, 45                                                                                                                                                                                   |
| Cause, 386, 590, 591, B-35                                                                                                                                                                                                          | carry lookahead speed versus, C-46                                                                                                                                                                                                    | Send message routine, 641                                                                                                                                                                            |
| clock cycle time and, 81                                                                                                                                                                                                            | RISC. See Desktop and server RISCs;                                                                                                                                                                                                   | Sensitivity list, C-24                                                                                                                                                                               |
| compiling C assignment with, 81-82                                                                                                                                                                                                  | Embedded RISCs; Reduced                                                                                                                                                                                                               | Sequencers                                                                                                                                                                                           |
| Count, B-34                                                                                                                                                                                                                         | instruction set computer (RISC)                                                                                                                                                                                                       | explicit, D-32                                                                                                                                                                                       |
| defined, 80                                                                                                                                                                                                                         | architectures                                                                                                                                                                                                                         | implementing next-state function                                                                                                                                                                     |
| destination, 98, 319                                                                                                                                                                                                                | Roofline model, 667–75                                                                                                                                                                                                                | with, D-22–28                                                                                                                                                                                        |
| floating-point, 265                                                                                                                                                                                                                 | benchmarking multicores with,                                                                                                                                                                                                         | Sequential logic, C-5                                                                                                                                                                                |
| left half, 348                                                                                                                                                                                                                      | 675–84                                                                                                                                                                                                                                | Servers                                                                                                                                                                                              |
| mapping, 94                                                                                                                                                                                                                         | with ceilings, 672, 674                                                                                                                                                                                                               | cost and capability, 5                                                                                                                                                                               |
| MIPS conventions, 121                                                                                                                                                                                                               | computational roofline, 673                                                                                                                                                                                                           | defined, 5                                                                                                                                                                                           |
| number specification, 309                                                                                                                                                                                                           | IBM Cell QS20, 678                                                                                                                                                                                                                    | See also Desktop and server RISCs                                                                                                                                                                    |
| page table, 497                                                                                                                                                                                                                     | illustrated, 669                                                                                                                                                                                                                      | Set-associative caches, 479–80                                                                                                                                                                       |
| pipeline, 366, 367, 368, 371                                                                                                                                                                                                        | Intel Xeon e5345, 678                                                                                                                                                                                                                 | address portions, 484                                                                                                                                                                                |
| primitives, 80–81                                                                                                                                                                                                                   | I/O intensive kernel, 675                                                                                                                                                                                                             | block replacement strategies,                                                                                                                                                                        |
| Receiver Control, B-39                                                                                                                                                                                                              | Opteron generations, 670                                                                                                                                                                                                              | 521                                                                                                                                                                                                  |
| Receiver Data, B-38, B-39                                                                                                                                                                                                           | with overlapping areas shaded, 674 peak floating-point performance, 668                                                                                                                                                               | choice of, 520                                                                                                                                                                                       |
| renaming, 397                                                                                                                                                                                                                       |                                                                                                                                                                                                                                       | defined, 479                                                                                                                                                                                         |
| right half, 348                                                                                                                                                                                                                     | peak memory performance, 669                                                                                                                                                                                                          | four-way, 481, 486                                                                                                                                                                                   |
| spilling, 86                                                                                                                                                                                                                        | Sun UltraSPARC T2, 678<br>with two kernels, 674                                                                                                                                                                                       | memory-block location, 480                                                                                                                                                                           |
| Status, 386, 590, 591, B-35                                                                                                                                                                                                         | with two kerners, 0/4                                                                                                                                                                                                                 | misses, 482–83                                                                                                                                                                                       |

I-22

| Set-associative caches (continued)                    | SIMD (Single Instruction Multiple Data), | defined, 245                            |
|-------------------------------------------------------|------------------------------------------|-----------------------------------------|
| n-way, 479                                            | 649, 659                                 | See also Double precision               |
| two-way, 481                                          | computers, CD7.14:1-3                    | Single-program multiple data (SPMD),    |
| See also Caches                                       | data vector, A-35                        | 648, A-22                               |
| Set instructions, 109                                 | extensions, CD7.14:3                     | Small Computer Systems Interface (SCSI) |
| Setup time, C-53, C-54                                | for loops and, CD7.14:2                  | disks, 577, 613                         |
| Shaders, CDA.11:3                                     | massively parallel multiprocessors,      | Smalltalk                               |
| defined, A-14                                         | CD7.14:1                                 | Smalltalk-80, CD2.20:7                  |
| floating-point arithmetic, A-14                       | small-scale, CD7.14:3                    | SPARC support, E-30                     |
| graphics, A-14–15                                     | vector architecture, 650–53              | Snooping protocol, 536–37, 538          |
| pixel example, A-15–17                                | in x86, 649–50                           | Snoopy cache coherence, CD5.9:16        |
| Shading languages, A-14                               | SIMMs (single inline memory modules),    | Software                                |
| Shared memory                                         | CD5.13:4, CD5.13:5                       | GPU driver, 655                         |
| caching in, A-58–60                                   | Simple programmable logic devices        | layers, 10                              |
| CUDA, A-58                                            | (SPLDs), C-78                            | multiprocessor, 632                     |
| defined, A-21                                         | Simplicity, 176                          | parallel, 633                           |
| as low-latency memory, A-21                           | Simultaneous multithreading              | as service, 606, 686                    |
| N-body and, A-67–68                                   | (SMT), 646–48                            | systems, 10                             |
| per-CTA, A-39                                         | defined, 646                             | Sort algorithms, 157                    |
| SRAM banks, A-40                                      | support, 647                             | Sorting performance, A-54–55            |
| See also Memory                                       | thread-level parallelism, 647            | Sort procedure, 150–55                  |
| Shared-memory multiprocessors (SMP),                  | unused issue slots, 648                  | code for body, 151–53                   |
| 638–40                                                | Single-clock-cycle pipeline              | defined, 150                            |
| defined, 633, 638                                     | diagrams, 356                            | full procedure, 154–55                  |
| single physical address                               | defined, 356                             | passing parameters in, 154              |
| space, 638                                            | illustrated, 358                         | preserving registers in, 154            |
| synchronization, 639                                  | Single-cycle datapaths                   | procedure call, 153                     |
| Shift amount, 97                                      | illustrated, 345                         | register allocation for, 151            |
| Shift instructions, 102, B-55–56                      | instruction execution, 346               | See also Procedures                     |
| Signals                                               | See also Datapaths                       | Source files, B-4                       |
| asserted, 305, C-4                                    | Single-cycle implementation              | Source language, B-6                    |
| control, 306, 320, 321, 322                           | control function for, 327                | South bridge, 584                       |
| deasserted, 305, C-4                                  | defined, 327                             | Space allocation                        |
|                                                       | nonpipelined execution versus            | -                                       |
| Sign and magnitude, 245                               |                                          | on heap, 120–22<br>on stack, 119        |
| Sign bit, 90                                          | pipelined execution, 334                 | SPARC                                   |
| Signed division, 239–41<br>Signed multiplication, 234 | non-use of, 328–30                       |                                         |
| -                                                     | penalty, 330                             | annulling branch, E-23                  |
| Signed numbers, 87–94                                 | pipelined performance versus,            | CASA, E-31                              |
| sign and magnitude, 89                                | 332–33                                   | conditional branches, E-10–12           |
| treating as unsigned, 110                             | Single-instruction multiple-thread       | fast traps, E-30                        |
| Sign extension, 310                                   | (SIMT), A-27–30                          | floating-point operations, E-31         |
| defined, 124                                          | defined, A-27                            | instructions, E-29–32                   |
| shortcut, 92–93                                       | multithreaded warp scheduling, A-28      | least significant bits, E-31            |
| Significands, 246                                     | overhead, A-35                           | multiple precision floating-point       |
| addition, 250                                         | processor architecture, A-28             | results, E-32                           |
| multiplication, 255                                   | warp execution and divergence,           | nonfaulting loads, E-32                 |
| Silicon                                               | A-29–30                                  | overlapping integer operations, E-31    |
| crystal ingot, 45                                     | Single instruction single data           | quadruple precision floating-point      |
| defined, 45                                           | (SISD), 648                              | arithmetic, E-32                        |
| as key hardware technology, 54                        | Single precision                         | register windows, E-29–30               |
| wafers, 45                                            | binary representation, 248               | support for LISP and Smalltalk, E-30    |

| Sparse matrices, A-55–58              | Split caches, 470                  | Static multiple-issue processors, 392, |
|---------------------------------------|------------------------------------|----------------------------------------|
| Sparse Matrix-Vector multiply (SpMV), | Square root instructions, B-79     | 393–97                                 |
| 679–80, 681, A-55,                    | Stack architectures, CD2.20:3      | control hazards and, 394               |
| A-57, A-58                            | Stack pointers                     | instruction sets, 393                  |
| CUDA version, A-57                    | adjustment, 116                    | with MIPS ISA, 394–97                  |
| serial code, A-57                     | defined, 114                       | See also Multiple issue                |
| shared memory version, A-59           | values, 116                        | Static random access memories (SRAMs), |
| Spatial locality, 452–53              | Stacks                             | C-58-62                                |
| defined, 452                          | allocating space on, 119           | array organization, C-62               |
| large block exploitation of, 464      | for arguments, 156                 | basic structure, C-61                  |
| tendency, 456                         | defined, 114                       | defined, 20, C-58                      |
| SPEC, CD1.10:10-11                    | pop, 114                           | fixed access time, C-58                |
| CPU benchmark, 48-49                  | push, 114, 116                     | large, C-59                            |
| defined, CD1.10:10                    | recursive procedures, B-29-30      | read/write initiation, C-59            |
| power benchmark, 49–50                | Stack segment, B-22                | synchronous (SSRAMs), C-60             |
| SPEC89, CD1.10:10                     | Stalls, 338–39                     | three-state buffers, C-59, C-60        |
| SPEC92, CD1.10:11                     | avoiding with code reordering,     | Static variables, 118                  |
| SPEC95, CD1.10:11                     | 338–39                             | Status register, 590                   |
| SPEC2000, CD1.10:11                   | behavioral Verilog with detection, | fields, B-34, B-35                     |
| SPEC2006, 282, CD1.10:11              | CD4.12:5–9                         | illustrated, 591                       |
| SPECPower, 597                        | data hazards and, 371-74           | Steady-state prediction, 380           |
| SPECrate, 664                         | defined, 338                       | Sticky bits, 268                       |
| SPECratio, 48                         | illustrations, CD4.12:25,          | Storage                                |
| Special function units (SFUs), A-35   | CD4.12:28-30                       | disk, 575–79                           |
| defined, A-43                         | insertion into pipeline, 374       | flash, 580–82                          |
| GeForce 8800, A-50                    | load-use, 377                      | nonvolatile, 575                       |
| Speculation, 392–93                   | memory, 478                        | Storage area networks (SANs),          |
| defined, 392                          | as solution to control hazard, 340 | CD6.11:11                              |
| hardware-based, 400                   | write-back scheme, 476             | Store buffers, 403                     |
| implementation, 392                   | write buffer, 476                  | Stored program concept, 77             |
| performance and, 393                  | Standby spares, 605                | as computer principle, 100             |
| problems, 393                         | State                              | illustrated, 101                       |
| recovery mechanism, 393               | in 2-bit prediction scheme, 381    | principles, 176                        |
| Speed-up challenge, 635–38            | assignment, C-70, D-27             | Store instructions                     |
| balancing load, 637–38                | bits, D-8                          | access, A-41                           |
| bigger problem, 636–37                | exception, saving/restoring, 515   | base register, 319                     |
| Spilling registers, 86, 115           | logic components, 305              | block, 165                             |
| SPIM, B-40–45                         | specification of, 496              | compiling with, 85                     |
| byte order, B-43                      | State elements                     | conditional, 138–39                    |
| defined, B-40                         | clock and, 306                     | defined, 85                            |
| features, B-42–43                     | combinational logic and, 306       | details, B-68–70                       |
| getting started with, B-42            | defined, 305, C-48                 | EX stage, 353                          |
| MIPS assembler directives support,    | inputs, 305                        | floating-point, B-79                   |
| B-47–49                               | register file, C-50                | ID stage, 349                          |
| speed, B-41                           | in storing/accessing instructions, | IF stage, 349                          |
| system calls, B-43–45                 | 308                                | instruction dependency, 371            |
| versions, B-42                        | Static branch prediction, 393      | list of, B-68–70                       |
| virtual machine simulation, B-41–42   | Static data                        | MEM stage, 354                         |
| SPLASH/SPLASH 2 (Stanford Parallel    | defined, B-20                      | unit for implementing, 311             |
| Applications for Shared-Memory),      | as dynamic data, B-21              | WB stage, 354                          |
| 664–66                                | segment, 120                       | See also Load instructions             |
|                                       |                                    |                                        |

I-24

| Store word, 85                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|----------------------------------|
| defined, 124   say   s   | Store word, 85                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Supercomputers, 5, CD4.15:1 | Т                                |
| as leaf procedure, 126     pointers, 126     poi | The state of the s |                             |                                  |
| See also Procedures   Stream benchmark, 675   Streaming multiprocessor (SM), A-48-49   Streaming processors, 657, A-34   Streaming multiprocessors (SM), A-49-50   Surfaces, A-41   Swap procedure, 149-50   Task kidentifiers, 510   Task parallelism, A-24   TCP/IP packet format, CD6.11:4   Teles PTX ISA, A-31-34   array (SPA), A-41, A-46   body code, 150   defined, 149   TCP/IP packet format, CD6.11:4   Teles PTX ISA, A-31-34   arrity (SPA), A-41, A-46   Swap space, 498   Teles PTX ISA, A-31-34   arrity (SPA), A-41, A-46   Swep space, 498   GPU thread instructions, A-33   Swritch computer, CD4.15:1   Swap space, 498   GPU thread instructions, A-33   Swritch computer, CD4.15:1   Swap space, 498   GPU thread instructions, A-32   memory access instructions, A-34   arrity metric instructions, A-35   Strings   Swritche, CD6.11:6-7   A-33-34   defined, 124   Synchronization, 137-39   defined, 137   Teles PTX ISA, A-31-34   defined, 639   Teles PTX ISA, A-31-34   Arrity metric instructions, A-32   Teles PTX ISA, A-31-34   Teles PTX ISA, A   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | -                           |                                  |
| Sce also Procedures         398, 399-400         page tables and, 498           Stream benchmark, 675         multithreading options, 646         size of, 486-87           Streaming multiprocessor (SM), A-48-49         Surfaces, A-41         Tail call, 121           Streaming processors, 657, A-34         Swap procedure, 149-50         Task identifiers, 510           Geforce 8800, A-9-50         defined, 149         TCP/IP packet format, CD6.11-4           Streaming SIMD Extension 2 (SSE2)         full, 150, 151         Telsa PTX ISA, A-31-34           Streich computer, CD4.15:1         Swap space, 498         GPU threat instructions, A-33         barrier, A18           Strings         Switched networks, CD6.11-5         defined, 124         Switched. Dil-16-7         A-33-34         Temporal locality, 453           Striping, 601         barrier, A.18, A-20, A-34         defined, 639         Temporal locality, 453         defined, 452           Striping, 601         barrier, A.18, A-20, A-34         defined, 639         Temporary registers, 81, 115           Strong scaling, 637, 638         defined, 639         Terestand, 353-56, 352         Terestand, 353-36, 352         Terestand, 35                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| Stream benchmark, 675         multithreading options, 646         size of, 486–87           Streaming multiprocessor (SM), A-48–49         Surfaces, A-41         Tall call, 121           Streaming processors, 657, A-34         array (SPA), A-41, A-46         body code, 150         Task identifiers, 510           GeForce 8800, A-99–50         defined, 149         TCP/IP pacte fromat, CD6.11:4           Streaming SIMD Extension 2 (SSE2)         full, 150, 151         Telsa PTX ISA, A-31–34           floating-point architecture, 274–75         See also Procedures         Swap space, 498           Stretch computer, CD4.15:1         Swap space, 498         GPU thread instructions, A-33           Stretch computer, CD4.15:1         Switched networks, CD6.11:5         memory access instructions, A-32           defined, 124         Switched, CD6.11:6–7         synchroinzation, 137–39         barrier, A-18, A-20, A-34         GPU thread instructions, A-33         defined, 62-7         Temporal locality, 453         defined, 452         tendency, 456                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                             | in locating block, 484           |
| Streaming multiprocessors (SM), A-48–49         Surfaces, A-41         Tail call, 121           Streaming processors, 657, A-34         Swap procedure, 149–50         Task identifiers, 510           array (SPA), A-41, A-46         body code, 150         Task parallelism, A-24           GeForce 8800, A-9–50         defined, 149         TCP/IP packet format, CD6.11-4           Streaming SIMD Extension 2 (SSE2)         full, 150, 151         Teles PTX ISA, A-31-34           floating-point architecture, 274–75         See also Procedures         Tail call, 121           Strings         Switches, C96.11-5         Teles PTX ISA, A-31-34           Strings         Switches, CD6.11-5         Mary Space, 498         GPU thread instructions, A-33           Strings, 601         Switches, CD6.11-5         Mary Space, 498         Temporal locality, 453         GPU thread instructions, A-32           Strings, 601         Synchronization, 137–39         defined, 452         tendency, 456         Temporal locality, 455         defined, 452         tendency, 456         Temporal locality, 455         defined, 452         tendency, 456         Temporary registers, 81, 115         Terabytes, 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| Streaming processors, 657, A-34         Swap procedure, 149–50         Task identifiers, 510           GeForce 8800, A-49–50         defined, 149         TCP//P packet format, CD6.11:4           GeForce 8800, A-49–50         (SEE2)         full, 150, 151         Telsa PTX ISA, A-31–34           Streaming SIMD Extension 2 (SSE2)         full, 150, 151         Telsa PTX ISA, A-31–34           gland point architecture, 274–75         See also Procedures         barrier synchronization, A-33           Stretch computer, CD4.15:1         Swap space, 498         GPU thread instructions, A-32           Strings         Switched networks, CD6.11:6–7         memory sees instructions, A-32           feffined, 124         Switches, CD6.11:6–7         Tenderical, 34           striping, 601         barrier, A-18, A-20, A-34         tendency, 456           Strong scaling, 637, 638         defined, 639         Temporary registers, 81, 115           Structural hazards, 335–36, 552         lock, 137         Terabytes, 5           Structural particular particul                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | size of, 486–87                  |
| array (SPA), A.41, A.46         body code, 150         Task parallelism, A.24           GeForce 8800, A-49-50         defined, 149         TCP/IP packet format, CD6.11:4           Streaming SIMD Extension 2 (SSE2)         full, 150, 151         Telsa PTX ISA, A.31-34           floating-point architecture, 274-75         See also Procedures         barrier synchronization, A.34           Strings         Switched networks, CD6.11:5         barrier synchronization, A.34           Gefined, 124         Switches, CD6.11:6-7         memory access instructions, A.33           in Java, 126-27         Symbol tables, 141, B-12, B-13         Temporal locality, 453           representation, 124         Synchronization, 137-39         defined, 452           Striping, 601         barrier, A-18, A-20, A-34         tendency, 456           Stroug scaling, 637, 638         defined, 639         Temporary registers, 81, 115           Structured Query Language (SQL),         overhead, reducing, 43         Text segment, B-13           Subnormals, 270         Synchronizers         Text segment, B-13           Subtracks, 606         defined, C-76         Text segment, B-13           Subtraction, 224-29         from D flip-flop, C-76         Text segment, B-13           Binary, 224-25         failure, C-77         Texture processor cluster (TPC),           Sub rasic                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Tail call, 121                   |
| Streaming SIMD Extension 2 (SSE2)   full, 150, 151   Telsa PTX ISA, A-31-34   arithmetic instructions, A-33   arithmetic instructions, A-34   Streaming SIMD Extension 2 (SSE2)   full, 150, 151   Telsa PTX ISA, A-31-34   arithmetic instructions, A-33   arithmetic instructions, A-33   Streaming SIMD Extension 2 (SSE2)   Swap space, 498   GPU thread instructions, A-34   GPU thread instructions, A-32   Strings   Switches, CD6.11:5   Swap space, 498   GPU thread instructions, A-32   memory access instructions, A-32   memory space   memory, 51   memory access instructions, A-32   memory access instructions, A-32   memory space   memory, A-40    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Task identifiers, 510            |
| Streaming SIMD Extension 2 (SSE2) floating-point architecture, 274-75         feating-point architecture, register allocation, 149-50 arithmetic instructions, A-34         Telsa PTX ISA, A-31-34 arithmetic instructions, A-34 barrier synchronization, A-34           Stretch computer, CD4.15:1         Swap space, 498 Sovitched networks, CD6.11:5 switches, CD6.11:6-7 in Java, 126-27 switches, CD6.11:6-7 switches, CD6.11:6-7 in Java, 126-27 symbol tables, 141, B-12, B-13 representation, 124 synchronization, 137-39 defined, 452 defined, 639         Temporal locality, 453 defined, 452 defined, 452 defined, 639           Strog scaling, 637, 638         defined, 639         Temporary registers, 81, 115 reabytes, 5           Structural hazards, 335-36, 532         lock, 137         Terabytes, 5           Structural pactural hazards, 335-36, 532         lock, 137         Terabytes, 5           Structural pactural pacture in pacture                                                                                                                                                                                                                                                    | •                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                             |                                  |
| floating-point architecture,   register allocation, 149–50   Ser also Procedures   Ser also Procedures   Ser also Procedures   Subarrier synchronization, A-34   Stretch computer, CD4.15:1   Swap space, 498   GPU thread instructions, A-32   Strings   Switches, CD6.11:5   memory access instructions, and defined, 124   Switches, CD6.11:6-7   A-33–34   Temporal locality, 453   defined, 124   Synchronization, 137–39   defined, 452   Striping, 601   barrier, A-18, A-20, A-34   tendency, 456   tendency, 456   Strong scaling, 637, 638   defined, 639   Temporary registers, 81, 115   Terabytes, 5   Structured Query Language (SQL), overhead, reducing, 43   Test amultiprocessor, 658   CD6.14:5   unlock, 137   Test segment, B-13   Test segment, B-13   Subnormals, 270   Synchronizers   Texture memory, A-40   Subtracts, 606   defined, C-76   Texture/processor cluster (TPC), Subtraction, 224–29   from D flip-flop, C-76   A-47–48   Sinary, 224–25   floating-point, 259, B-79–80   Synchronous bus, 583   Thrashing, 517   Thread blocks, 659   creation, A-23   defined, A-19   See also Arithmetic   C-60   managing, A-30   memory barring, A-20   Synchronous SRAM (SSRAM), defined, A-19   C-60   managing, A-30   memory sharing, A-20   Synchronous System, C-48   memory sharing, A-20   Thread parallelism, A-22   Sun Fire x4150 server, 606–12   System calls, B-43–45   Thread dispatch, 659   Thread idle and peak power, 612   defined, 509   Thread parallelism, A-22   System Performance Evaluation   Minimum memory, 611   Cooperative, See SPEC   SA, A-31–34   memory sharing, A-20   memory s   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | TCP/IP packet format, CD6.11:4   |
| 274-75         See also Procedures         barrier synchronization, A-34           Stretch computer, CD4.15:1         Sway space, 498         GPU thread instructions, A-32           Strings         Switches (D6.11:6-7)         A-33-34           defined, 124         Switches, CD6.11:6-7         A-33-34           in Java, 126-27         Symbol tables, 141, B-12, B-13         Temporal locality, 453           representation, 124         Synchronization, 137-39         defined, 452           Striping, 601         barrier, A-18, A-20, A-34         tendency, 456           Stroug scaling, 637, 638         defined, 639         Temporary registers, 81, 115           Structural Azards, 335-36, 352         lock, 137         Terabytes, 5           Structural Aguards, 335-36, 352         lock, 137         Terabytes, 5           Structural Aguards, 335-36, 352         lock, 137         Text segment, B-13           Subnormals, 270         Synchronizers         Texture remory, 4-40           Subtracks, 606         defined, C-76         Texture remory, A-40           Subtracks, 606         defined, C-76         Texture remory, A-40           Subtracks, 606         defined, C-77         Texture/processor, CD7.14:5           floating-point, 259, B-79-80         Synchronous bus, 583         Thread blocks, 659 <th< td=""><td></td><td></td><td>Telsa PTX ISA, A-31–34</td></th<>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Telsa PTX ISA, A-31–34           |
| Stretch computer, CD4.15:1Swap space, 498GPU thread instructions, A-32StringsSwitched networks, CD6.11:5memory access instructions,defined, 124Switches, CD6.11:6-7A-33-34in Java, 126-27Symbol tables, 141, B-12, B-13Temporal locality, 453Striping, 601barrier, A-18, A-20, A-34tendency, 456Strong scaling, 637, 638defined, 639Temporary registers, 81, 115Structural hazards, 335-36, 352lock, 137Terabytes, 5Structured Query Language (SQL),overhead, reducing, 43Tesla multiprocessor, 658CD6.14:5unlock, 137Text segment, B-13Subracks, 606defined, C-76Text ture processor cluster (TPC),Subtracks, 606defined, C-76Text ture processor cluster (TPC),Subtraction, 224-29from D flip-flop, C-76A-47-48binary, 224-25failure, C-77TFLOPS multiprocessor, CD7.14:5floating-point, 259, B-79-80Synchronous DRAM (SRAM),<br>negative number, 226Thrashing, 517overflow, 226Synchronous SRAM (SRAM),<br>See also ArithmeticC-60managing, A-30Sun Fire x4150 server, 606-12Synchronous system, C-48memory sharing, A-20Sun Fire x4150 server, 606-12System calls, B-43-45Thread dispatch, 659front/rear illustration, 608code, B-43-44CUDA, A-36ide and peak power, 612defined, 509Threadlogical connections and bandwidths,<br>609System Performance Evaluation<br>minimum memory, 611Cooperative. See SPECISA, A-31-34Sun UltrasP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | ~ -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | •                           | arithmetic instructions, A-33    |
| Strings         Switched networks, CD6.11:5         memory access instructions, defined, 124         Switches, CD6.11:6-7         A.33-34           in Java, 126-27         Symbol tables, 141, B-12, B-13         Temporal locality, 453           representation, 124         Synchronization, 137-39         defined, 452           Striping, 601         barrier, A-18, A-20, A-34         tendency, 456           Strong scaling, 637, 638         defined, 639         Temporary registers, 81, 115           Structural hazards, 335-36, 352         lock, 137         Terabytes, 5           Structured Query Language (SQL),         overhead, reducing, 43         Tesla multiprocessor, 658           CD6.14:5         unlock, 137         Texts segment, B-13           Subnormals, 270         Synchronicers         Texture memory, A-40           Subtraction, 224-29         from D flip-flop, C-76         A-47-48           Binary, 224-25         failure, C-77         TLDOPS multiprocessor, CD7.14:5           floating-point, 259, B-79-80         Synchronous DRAM (SRAM),         Thread blocks, 659           negative number, 226         473, C-60, C-65         creation, A-23           overflow, 226         Synchronous SRAM (SSRAM),         defined, A-19           Subword parallelism, E-17         Synchronous Srama, C-48         memory sharing, A-20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| defined, 124         Switches, CD6.11:6-7         A-3-3-4           in Java, 126-27         Symbol tables, 141, B-12, B-13         Temporal locality, 453           representation, 124         Synchronization, 137-39         defined, 452           Striping, 601         barrier, A-18, A-20, A-34         tendency, 456           Stroug scaling, 637, 638         defined, 639         Temporary registers, 81, 115           Structured Query Language (SQL),         overhead, reducing, 43         Testa multiprocessor, 658           CD6.14:5         unlock, 137         Texts egment, B-13           Subnormals, 270         Synchronizers         Texture memory, A-40           Subtracks, 606         defined, C-76         Texture/processor cluster (TPC),           Subtracks, 606         defined, C-76         Texture/processor cluster (TPC),           Subtraction, 224-29         from D flip-flop, C-76         A-47-48           floating-point, 259, B-79-80         Synchronous bus, 583         Thrashing, 517           instructions, B-56-57         Synchronous DRAM (SRAM),         Thread blocks, 659           negative number, 226         473, C-60, C-65         creation, A-23           Sue also Arithmetic         C-60         managing, A-30           Sun Fire x4150 server, 606-12         System calls, B-43-45         Thread dispatch, 659                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | GPU thread instructions, A-32    |
| in Java, 126–27 representation, 124 Synchronization, 137–39 Striping, 601 Striping, 601 Striping, 637, 638 defined, 639 Structural hazards, 335–36, 352 Subnormals, 270 Synchronizers Subnormals, 270 Subtracks, 606 defined, C-76 Subtraction, 224–29 from D flip-flop, C-76 Subtraction, 224–29 floating-point, 259, B-79–80 Synchronous DRAM (SRAM), negative number, 226 overflow, 226 Synchronous DRAM (SRAM), See also Arithmetic C-60 Subword parallelism, E-17 Synchronous SRAM (SSRAM), See also Arithmetic C-60 Subord parallelism, E-17 Synchronous System, C-48 Synchronous System, C-48 Synchronization, 4-20 System calls, B-43–45 front/rear illustration, 608 idle and peak power, 612 logical connections and bandwidths, 609 minimum memory, 611 Cooperative. See SPEC System Software, 10 System Performance Evaluation CUDA, A-36 CUDA, A-36 CUDA, A-36 Cundad, 677 defined, 677 defined, 678 cache data and tag modules, characteristics, 677 defined, 678 class defined, 682 CD5.9:3 type declarations, CD5.9:1, multiple, per body, A-68–69 warps, A-27 defined, 678 imultiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | e                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                             | memory access instructions,      |
| representation, 124         Synchronization, 137–39         defined, 452           Striping, 601         barrier, A-18, A-20, A-34         tendency, 456           Strong scaling, 637, 638         defined, 639         Temporary registers, 81, 115           Structural hazards, 335–36, 352         lock, 137         Terabytes, 5           Structured Query Language (SQL),         overhead, reducing, 43         Testa multiprocessor, 658           CD6.14:5         unlock, 137         Text segment, B-13           Subnormals, 270         Synchronizers         Texture/processor cluster (TPC),           Subtracks, 606         defined, C-76         Texture/processor cluster (TPC),           Subtraction, 224–29         from D flip-flop, C-76         A-47–48           binary, 224-25         failure, C-77         TFLOPS multiprocessor, CD7.14:5           floating-point, 259, B-79–80         Synchronous DRAM (SRAM),         Thread blocks, 659           negative number, 226         473, C-60, C-65         creation, A-23           overflow, 226         Synchronous SRAM (SSRAM),         defined, A-19           See also Arithmetic         C-60         managing, A-30           Sum of products, C-11, C-12         Syntax tree, CD2.15:3         synchronization, A-20           Sun Fire x4150 server, 606–12         System Call (SR-4)         Thread                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | defined, 124                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                             | A-33–34                          |
| Striping, 601         barrier, A-18, A-20, A-34         tendency, 456           Strong scaling, 637, 638         defined, 639         Temporary registers, 81, 115           Structural hazards, 335–36, 352         lock, 137         Terabytes, 5           Structured Query Language (SQL),         overhead, reducing, 43         Tesla multiprocessor, 658           CD6.14:5         unlock, 137         Text segment, B-13           Subnormals, 270         Synchronizers         Texture memory, A-40           Subtracks, 606         defined, C-76         Texture memory, A-40           Subtraction, 224–29         from D flip-flop, C-76         A-47–48           binary, 224–25         failure, C-77         TFLOPS multiprocessor, CD7.14:5           floating-point, 259, B-79–80         Synchronous DRAM (SRAM),         Thrashing, 517           instructions, B-56–57         Synchronous DRAM (SRAM),         defined, body           overflow, 226         473, C-60, C-65         creation, A-23           overflow, 226         Synchronous SRAM (SRAM),         defined, A-19           Subword parallelism, E-17         Synchronous yestem, C-48         memory sharing, A-20           Sum of products, C-11, C-12         Syntax tree, CD2.15:3         synchronization, A-23           fornot/rear illustration, 608         code, B-43-44         Thread dispa                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Temporal locality, 453           |
| Strong scaling, 637, 638 defined, 639 Structural hazards, 335–36, 352 Structured Query Language (SQL), overhead, reducing, 43 CD6.14:5 Subnormals, 270 Synchronizers Subtracks, 606 defined, C-76 Subtraction, 224–29 from D flip-flop, C-76 floating-point, 259, B-79–80 instructions, B-56–57 Synchronous bus, 583 instructions, B-56–57 Synchronous DRAM (SRAM), negative number, 226 overflow, 226 Synchronous SRAM (SSRAM), See also Arithmetic Subword parallelism, E-17 Synchronous system, C-48 Sum of products, C-11, C-12 Syntax tree, CD2.15:3 Synchronous system, C-48 Idefined, 699 side and peak power, 612 logical connections and bandwidths, 609 siminum memory, 611 Sun UltraSPARC T2 (Niagara 2), 647, 658 base versus fully optimized characteristics, 677 characteristics, 677 defined, 677 simple cache block diagram, characteristics, 677 cD5.9:3 coreline model, 678 defined, 678 sypte declarations, CD5.9:1, characteristics, 676 classing defined, 28 multiprocessor, 658 Tetalemorary registers, 81, 115 Terabytes, 5 Teabures, 58 Texture memory, 618 Texture memory, 4-40 Texture/processor cluster (TPC), A-47–48 TELOPS multiprocessor, CD7.14:5 Thrashing, 517 Thrashin | •                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                             | defined, 452                     |
| Structural hazards, 335–36, 352lock, 137Terabytes, 5Structured Query Language (SQL),<br>CD6.14:5overhead, reducing, 43Tesla multiprocessor, 658Subnormals, 270SynchronizersTexture memory, A-40Subtracks, 606defined, C-76Texture/processor cluster (TPC),Subtraction, 224–29from D flip-flop, C-76A-47-48binary, 224–25failure, C-77TFLOPS multiprocessor, CD7.14:5floating-point, 259, B-79–80Synchronous bus, 583Thrashing, 517instructions, B-56–57Synchronous DRAM (SRAM),<br>10 regative number, 226473, C-60, C-65creation, A-23overflow, 226Synchronous SRAM (SSRAM),<br>26 also ArithmeticC-60managing, A-30Subword parallelism, E-17Synchronous system, C-48memory sharing, A-20Sum of products, C-11, C-12Syntax tree, CD2.15:3synchronization, A-20Sun Fire x4150 server, 606–12System calls, B-43-45Thread dispatch, 659front/rear illustration, 608code, B-43-44Thread dispatch, 659idle and peak power, 612defined, 509Threadslogical connections and bandwidths,<br>609loading, B-43creation, A-23609System Performance Evaluation<br>minimum memory, 611Cooperative. See SPECISA, A-31-34Sun UltraSPARC T2 (Niagara 2),<br>647, 658System Serfware, 10managing, A-30base versus fully optimized<br>performance, 683<br>characteristics, 677<br>defined, 677CD5, 9:5Three Cs model, 523defined, 677<br>edfined, 677<br>clusted fined, 676<br>illustrated, 676<br>llustrated, 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| Structured Query Language (SQL), overhead, reducing, 43 Tesla multiprocessor, 658 CD6.14:5 unlock, 137 Text segment, B-13 Subnormals, 270 Synchronizers Texture memory, A-40 Subtracks, 606 defined, C-76 Texture/processor cluster (TPC), Subtraction, 224–29 from D flip-flop, C-76 A-47–48 binary, 224–25 failure, C-77 TFLOPS multiprocessor, CD7.14:5 floating-point, 259, B-79–80 Synchronous DRAM (SRAM), instructions, B-56–57 Synchronous DRAM (SRAM), negative number, 226 473, C-60, C-65 creation, A-23 overflow, 226 Synchronous SRAM (SSRAM), See also Arithmetic C-60 managing, A-30 Subword parallelism, E-17 Synchronous system, C-48 memory sharing, A-20 Sum of products, C-11, C-12 Syntax tree, CD2.15:3 synchronization, A-20 Sun Fire x4150 server, 606–12 System calls, B-43–45 Thread dispatch, 659 front/rear illustration, 608 code, B-43–44 Thread parallelism, A-22 idle and peak power, 612 defined, 509 Threads logical connections and bandwidths, 609 System Performance Evaluation minimum memory, 611 Cooperative. See SPEC Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30  647, 658 System Systems coftware, 10 managing, A-30 memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 Three Cs model, 523 characteristics, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 ILBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Temporary registers, 81, 115     |
| CD6.14:5 Subnormals, 270 Synchronizers Subtracks, 606 Subtracks, 606 Subtracks, 606 Subtraction, 224–29 From D flip-flop, C-76 Subtraction, 224–29 Sinary, 224–25 Failure, C-77 Floating-point, 259, B-79–80 Synchronous bus, 583 Intrashing, 517 Synchronous DRAM (SRAM), negative number, 226 Overflow, 226 Synchronous SRAM (SSRAM), See also Arithmetic Subword parallelism, E-17 Synchronous System, C-48 Sum of products, C-11, C-12 Syntax tree, CD2.15:3 Synchronous pystem, C-48 Sun Fire x4150 server, 606–12 System Calls, B-43–45 Intread dispatch, 659 Front/rear illustration, 608 idle and peak power, 612 logical connections and bandwidths, 609 Indiginal memory, 611 Sun UltraSPARC T2 (Niagara 2), 647, 658 System Verlog A47, C-60, C-65 Synchronous SRAM (SSRAM), Aefined, A-19 Memory sharing, A-20 Memory sharing, A-20 Memory sharing, A-20 Thread dispatch, 659 Thread dispatch, 659 Thread dispatch, 659 Thread of System Calls, B-43–45 Thread parallelism, A-22 Intread parallelism, A-22 Intread parallelism, A-22 Intread parallelism, A-23 Cooperative. See SPEC Synchronous SRAM (SSRAM), Mefined, A-19  System Performance Evaluation CUDA, A-36 CUDA, A-36  System Performance Evaluation CUDA, A-36  System Verflog Memory latencies and, A-74–75 Managing, A-30 Memory latencies and, A-74–75 Multiple, per body, A-68–69 Marps, A-27 Three Cs model, 523 Three Cs model, 523 Three-state buffers, C-59, C-60 IlbMHD performance, 682 CD5.9:3 CD5.9:3 CD5.9:3 CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| Subnormals, 270 Subtracks, 606 defined, C-76 Subtraction, 224–29 from D flip-flop, C-76 floating-point, 259, B-79–80 Synchronous bus, 583 instructions, B-56–57 Synchronous DRAM (SRAM), negative number, 226 overflow, 226 Synchronous DRAM (SRAM), See also Arithmetic See also Arithmetic Synchronous system, C-48 Subword parallelism, E-17 Synchronous system, C-48 Thread dispatch, 659 front/rear illustration, 608 idle and peak power, 610 logical connections and bandwidths, loading, B-43 Go9 System Performance Evaluation minimum memory, 611 Cooperative. See SPEC System Software, 10 System System Software, 10 System Verliog performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 defined, 677 FSM, CD5.9:5 defined, 677 illustrated, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 colline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | •                           | Tesla multiprocessor, 658        |
| Subtracks, 606 defined, C-76 Texture/processor cluster (TPC), Subtraction, 224–29 from D flip-flop, C-76 A-47–48 binary, 224–25 failure, C-77 TFLOPS multiprocessor, CD7.14:5 floating-point, 259, B-79–80 Synchronous bus, 583 Thrashing, 517 Ihread blocks, 659 negative number, 226 473, C-60, C-65 creation, A-23 overflow, 226 Synchronous SRAM (SSRAM), defined, A-19 managing, A-30 wordflow, 226 Synchronous system, C-48 memory sharing, A-20 Subword parallelism, E-17 Synchronous system, C-48 memory sharing, A-20 Sun Fire x4150 server, 606–12 System calls, B-43–45 Thread dispatch, 659 front/rear illustration, 608 code, B-43–44 Thread parallelism, A-22 idle and peak power, 612 defined, 509 Threads logical connections and bandwidths, 609 System Performance Evaluation Minimum memory, 611 Cooperative. See SPEC Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 multiple, per body, A-68–69 performance, 683 cache data and tag modules, cache controller, CD5.9:5 Three Cs model, 523 characteristics, 677 CD5.9:6–9 Three state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| Subtraction, 224–29 from D flip-flop, C-76 binary, 224–25 failure, C-77 floating-point, 259, B-79–80 Synchronous bus, 583 instructions, B-56–57 Synchronous DRAM (SRAM), negative number, 226 overflow, 226 Synchronous SRAM (SSRAM), See also Arithmetic C-60 Subword parallelism, E-17 Synchronous system, C-48 Synchronous system, C-48 Synchronous system, C-48 Synchronous system, C-48 Synchronic system, C-48 Synchroni | · · · · · · · · · · · · · · · · · · ·                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | •                           |                                  |
| binary, 224–25 floating-point, 259, B-79–80 Synchronous bus, 583 Thrashing, 517 instructions, B-56–57 Synchronous DRAM (SRAM), negative number, 226 overflow, 226 Synchronous SRAM (SSRAM), See also Arithmetic C-60 Synchronous SRAM (SSRAM), See also Arithmetic Synchronous system, C-48 Subword parallelism, E-17 Synchronous system, C-48 Sum of products, C-11, C-12 Syntax tree, CD2.15:3 Synchronization, A-20 Sun Fire x4150 server, 606–12 System calls, B-43–45 Infread dispatch, 659 front/rear illustration, 608 ide and peak power, 612 logical connections and bandwidths, 609 System Performance Evaluation G09 System Performance Evaluation G47, 658 System Verilog memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 Three S model, 523 Three, See SPE, C Sincher, C-80 Three, State buffers, C-59, C-60 Three, State buffe |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Texture/processor cluster (TPC), |
| floating-point, 259, B-79–80     instructions, B-56–57     instructions, B-56–57     negative number, 226     overflow, 226     overflow, 226     Synchronous SRAM (SSRAM), defined, A-19     See also Arithmetic     C-60     subword parallelism, E-17     Synchronous system, C-48     sum of products, C-11, C-12     Syntax tree, CD2.15:3     synchronization, A-20 Sun Fire x4150 server, 606–12     System calls, B-43–45     idle and peak power, 612     logical connections and bandwidths, 609     minimum memory, 611     Cooperative. See SPEC     System Software, 10     managing, A-30     minimum memory, 611     Cooperative. See SPEC     System Software, 10     managing, A-30     managing, A-30     memory sharing, A-20     syntheronization, A-20     Thread dispatch, 659     Thread parallelism, A-22     idle and peak power, 612     logical connections and bandwidths, 609     system Performance Evaluation     CUDA, A-36     minimum memory, 611     Cooperative. See SPEC     ISA, A-31–34     sun UltraSPARC T2 (Niagara 2), Systems software, 10     managing, A-30     memory latencies and, A-74–75     base versus fully optimized cache controller, CD5.9:1–9     performance, 683     cache data and tag modules, warps, A-27     characteristics, 677     CD5.9:5     Three Cs model, 523     Three, Standard, C59, C-60     illustrated, 676     simple cache block diagram, LBMHD performance, 682     roofline model, 678     type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | A-47–48                          |
| instructions, B-56–57     negative number, 226     negative number, 226     overflow, 226     Synchronous SRAM (SSRAM),     See also Arithmetic     C-60     subword parallelism, E-17     Synchronous system, C-48     memory sharing, A-20     sum of products, C-11, C-12     Syntax tree, CD2.15:3     synchronization, A-20 Sun Fire x4150 server, 606–12     System calls, B-43–45     front/rear illustration, 608     idle and peak power, 612     logical connections and bandwidths,     609     minimum memory, 611     Cooperative. See SPEC     illas, A-31–34     sun UltraSPARC T2 (Niagara 2),     647, 658     System Verilog     base versus fully optimized     performance, 683     characteristics, 677     close, 673     defined, 677     FSM, CD5.9:5     illustrated, 676     illustrated, 676     illustrated, 678     type declarations, CD5.9:1,     multiple issue and, 401  Thread blocks, 659     creation, A-23     creation, A-20     memory sharing, A-20     synthronization, 648     creation, A-20     Thread garallelism, A-22     Threads     creation, A-23     CUDA, A-36     ISA, A-31–34     managing, A-30     memory latencies and, A-74–75     multiple, per body, A-68–69     warps, A-27     Three Cs model, 523     Three Cs model, 523     defined, 677     FSM, CD5.9:6–9     illustrated, 676     simple cache block diagram,     LBMHD performance, 682     roofline model, 678     type declarations, CD5.9:1,     multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | TFLOPS multiprocessor, CD7.14:5  |
| negative number, 226 473, C-60, C-65 creation, A-23 overflow, 226 Synchronous SRAM (SSRAM), defined, A-19 See also Arithmetic C-60 managing, A-30 Subword parallelism, E-17 Synchronous system, C-48 memory sharing, A-20 Sum of products, C-11, C-12 Syntax tree, CD2.15:3 synchronization, A-20 Sun Fire x4150 server, 606-12 System calls, B-43-45 Thread dispatch, 659 front/rear illustration, 608 code, B-43-44 Thread parallelism, A-22 idle and peak power, 612 defined, 509 Threads logical connections and bandwidths, 609 System Performance Evaluation CUDA, A-36 minimum memory, 611 Cooperative. See SPEC ISA, A-31-34 Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 647, 658 SystemVerilog memory latencies and, A-74-75 base versus fully optimized cache controller, CD5.9:1-9 performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 characteristics, 677 FSM, CD5.9:6-9 Three Cs model, 523 characteristics, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | •                           | Thrashing, 517                   |
| overflow, 226Synchronous SRAM (SSRAM),<br>See also Arithmeticdefined, A-19<br>managing, A-30Subword parallelism, E-17Synchronous system, C-48memory sharing, A-20Sum of products, C-11, C-12Syntax tree, CD2.15:3synchronization, A-20Sun Fire x4150 server, 606–12System calls, B-43-45Thread dispatch, 659front/rear illustration, 608code, B-43-44Thread parallelism, A-22idle and peak power, 612defined, 509Threadslogical connections and bandwidths,<br>10sding, B-43creation, A-23609System Performance EvaluationCUDA, A-36minimum memory, 611Cooperative. See SPECISA, A-31-34Sun UltraSPARC T2 (Niagara 2),<br>647, 658System Software, 10managing, A-30hase versus fully optimized<br>performance, 683cache controller, CD5.9:1-9memory latencies and, A-74-75hase versus fully optimized<br>performance, 683cache data and tag modules,<br>cache data and tag modules,warps, A-27characteristics, 677<br>defined, 677CD5.9:5Three Cs model, 523defined, 676<br>illustrated, 676<br>illustrated, 676<br>illustrated, 676<br>illustrated, 678simple cache block diagram,<br>type declarations, CD5.9:1,Throughput<br>defined, 28<br>multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | •                           | Thread blocks, 659               |
| See also ArithmeticC-60managing, A-30Subword parallelism, E-17Synchronous system, C-48memory sharing, A-20Sum of products, C-11, C-12Syntax tree, CD2.15:3synchronization, A-20Sun Fire x4150 server, 606–12System calls, B-43-45Thread dispatch, 659front/rear illustration, 608code, B-43-44Thread parallelism, A-22idle and peak power, 612defined, 509Threadslogical connections and bandwidths,loading, B-43creation, A-23609System Performance EvaluationCUDA, A-36minimum memory, 611Cooperative. See SPECISA, A-31-34Sun UltraSPARC T2 (Niagara 2),Systems software, 10managing, A-30647, 658SystemVerilogmemory latencies and, A-74-75base versus fully optimizedcache controller, CD5.9:1-9multiple, per body, A-68-69performance, 683cache data and tag modules,warps, A-27characteristics, 677CD5.9:5Three Cs model, 523defined, 677FSM, CD5.9:6-9Three-state buffers, C-59, C-60illustrated, 676simple cache block diagram,ThroughputLBMHD performance, 682CD5.9:3defined, 28roofline model, 678type declarations, CD5.9:1,multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | e                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                             | creation, A-23                   |
| Subword parallelism, E-17 Synchronous system, C-48 Sum of products, C-11, C-12 Syntax tree, CD2.15:3 Synchronization, A-20 Sun Fire x4150 server, 606–12 System calls, B-43–45 Inread dispatch, 659 front/rear illustration, 608 code, B-43–44 Inread parallelism, A-22 idle and peak power, 612 logical connections and bandwidths, 609 System Performance Evaluation GUDA, A-36 minimum memory, 611 Cooperative. See SPEC ISA, A-31–34 Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 647, 658 SystemVerilog memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | defined, A-19                    |
| Sum of products, C-11, C-12Syntax tree, CD2.15:3synchronization, A-20Sun Fire x4150 server, 606-12System calls, B-43-45Thread dispatch, 659front/rear illustration, 608code, B-43-44Thread parallelism, A-22idle and peak power, 612defined, 509Threadslogical connections and bandwidths,loading, B-43creation, A-23609System Performance EvaluationCUDA, A-36minimum memory, 611Cooperative. See SPECISA, A-31-34Sun UltraSPARC T2 (Niagara 2),Systems software, 10managing, A-30647, 658SystemVerilogmemory latencies and, A-74-75base versus fully optimizedcache controller, CD5.9:1-9multiple, per body, A-68-69performance, 683cache data and tag modules,warps, A-27characteristics, 677CD5.9:5Three Cs model, 523defined, 677FSM, CD5.9:6-9Three-state buffers, C-59, C-60illustrated, 676simple cache block diagram,ThroughputLBMHD performance, 682CD5.9:3defined, 28roofline model, 678type declarations, CD5.9:1,multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | managing, A-30                   |
| Sun Fire x4150 server, 606–12 front/rear illustration, 608 idle and peak power, 612 logical connections and bandwidths, 609 system Performance Evaluation minimum memory, 611 Cooperative. See SPEC System Verilog base versus fully optimized cache controller, CD5.9:1–9 performance, 683 cache data and tag modules, characteristics, 677 characteristics, 677 clibrary CD5.9:5 logical connections and bandwidths, loading, B-43 Creation, A-23 CUDA, A-36 CUDA, A-36 ISA, A-31–34 Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 memory latencies and, A-74–75 multiple, per body, A-68–69 multiple, per body, A-68–69 marps, A-27 characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 677 FSM, CD5.9:6–9 illustrated, 676 simple cache block diagram, LBMHD performance, 682 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| front/rear illustration, 608 idle and peak power, 612 logical connections and bandwidths, 609 System Performance Evaluation minimum memory, 611 Cooperative. See SPEC System Software, 10 System Verilog base versus fully optimized performance, 683 cache data and tag modules, characteristics, 677 characteristics, 677 cliber of CD5.9:5 defined, 676 illustrated, 676 illustrated, 676 illustrated, 676 code, B-43-44 Thread parallelism, A-22 Threads CCUDA, A-36 CUDA, A-36 CUDA, A-36 System Verion CUDA, A-36 ISA, A-31-34 ISA, A-3 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | synchronization, A-20            |
| idle and peak power, 612 defined, 509 Threads logical connections and bandwidths, 609 System Performance Evaluation CUDA, A-36 minimum memory, 611 Cooperative. See SPEC ISA, A-31–34 Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 647, 658 SystemVerilog memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 multiple, per body, A-68–69 performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | •                           |                                  |
| logical connections and bandwidths, 609 System Performance Evaluation CUDA, A-36 minimum memory, 611 Cooperative. See SPEC ISA, A-31–34 Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 647, 658 SystemVerilog memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 multiple, per body, A-68–69 performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Thread parallelism, A-22         |
| 609 System Performance Evaluation CUDA, A-36 minimum memory, 611 Cooperative. See SPEC ISA, A-31–34 Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 647, 658 SystemVerilog memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 multiple, per body, A-68–69 performance, 683 cache data and tag modules, warps, A-27 characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, Throughput LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Threads                          |
| minimum memory, 611 Cooperative. See SPEC ISA, A-31–34  Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 647, 658 SystemVerilog memory latencies and, A-74–75  base versus fully optimized cache controller, CD5.9:1–9 multiple, per body, A-68–69 performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 Three Cs model, 523  defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | creation, A-23                   |
| Sun UltraSPARC T2 (Niagara 2), Systems software, 10 managing, A-30 647, 658 SystemVerilog memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 multiple, per body, A-68–69 performance, 683 cache data and tag modules, characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | CUDA, A-36                       |
| 647, 658 SystemVerilog memory latencies and, A-74–75 base versus fully optimized cache controller, CD5.9:1–9 multiple, per body, A-68–69 performance, 683 cache data and tag modules, warps, A-27 characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | ISA, A-31–34                     |
| base versus fully optimized performance, 683 cache data and tag modules, characteristics, 677 characteristics, 677 defined, 677 FSM, CD5.9:6–9 illustrated, 676 illustrated, 676 LBMHD performance, 682 roofline model, 678 cache data and tag modules, warps, A-27 Three Cs model, 523 Three-state buffers, C-59, C-60 Three-state buffers, C-59, C-60 Throughput defined, 28 multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | ,                           |                                  |
| performance, 683 cache data and tag modules, warps, A-27 characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| characteristics, 677 CD5.9:5 Three Cs model, 523 defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | multiple, per body, A-68–69      |
| defined, 677 FSM, CD5.9:6–9 Three-state buffers, C-59, C-60 illustrated, 676 simple cache block diagram, Throughput LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | _                           |                                  |
| illustrated, 676 simple cache block diagram, Throughput LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| LBMHD performance, 682 CD5.9:3 defined, 28 roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | Three-state buffers, C-59, C-60  |
| roofline model, 678 type declarations, CD5.9:1, multiple issue and, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             | 6 1                              |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
| SpMV performance, 681 CD5.9:2 pipelining and, 344, 401                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                             |                                  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | SpMV performance, 681                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | CD5.9:2                     | pipelining and, 344, 401         |

| TI . 1 F 15 F 20                      | T                                    | 7.6                                    |
|---------------------------------------|--------------------------------------|----------------------------------------|
| Thumb, E-15, E-38                     | Two-phase clocking, C-75             | V                                      |
| Timing                                | Two's complement representation,     | Tr                                     |
| asynchronous inputs, C-76–77          | 89, 90                               | Vacuum tubes, 26                       |
| level-sensitive, C-75–76              | advantage, 90                        | Valid bit, 458                         |
| methodologies, C-72–77                | defined, 89                          | Variables                              |
| two-phase, C-75                       | negation shortcut, 91–92             | C language, 118                        |
| TLB misses, 503                       | rule, 93                             | programming language, 81               |
| entry point, 514                      | sign extension shortcut, 92–93       | register, 81                           |
| handler, 514                          | TX-2 computer, CD7.14:3              | static, 118                            |
| handling, 510–16                      |                                      | storage class, 118                     |
| minimization, 681                     | U                                    | type, 118                              |
| occurrence, 510                       | TT 100 11 1 100                      | VAX architecture, CD2.20:3, CD5.13:6   |
| problem, 517                          | Unconditional branches, 106          | Vectored interrupts, 386               |
| See also Translation-lookaside buffer | Underflow, 245                       | Vector processors, 650–53              |
| (TLB)                                 | Unicode                              | conventional code comparison,          |
| Tomasulo's algorithm, CD4.15:2        | alphabets, 126                       | 650–51                                 |
| Tournament branch predicators,        | defined, 126                         | instructions, 652                      |
| 383                                   | example alphabets, 127               | multimedia extensions and, 653         |
| Tracks, 575                           | Unified GPU architecture,            | scalar versus, 652                     |
| Transaction Processing Council        | A-10–12                              | See also Processors                    |
| (TPC), 596                            | illustrated, A-11                    | Verilog                                |
| Transaction processing (TP)           | processor array, A-11–12             | behavioral definition of MIPS          |
| defined, 596                          | Uniform memory access (UMA), 638–39, | ALU, C-25                              |
| I/O benchmarks, 596–97                | A-9                                  | behavioral definition with bypassing,  |
| Transfer time, 576                    | defined, 638                         | CD4.12:4-5                             |
| Transistors, 26                       | multiprocessors, 639                 | behavioral definition with stalls for  |
| Translation-lookaside buffer (TLB),   | Units                                | loads, CD4.12:6-7, CD4.12:8-9          |
| 502–4, CD5.13:5                       | commit, 399, 402                     | behavioral specification, C-21,        |
| associativities, 503                  | control, 303, 316–17, D-4–8, D-10,   | CD4.12:2-3                             |
| defined, 502                          | D-12-13                              | behavioral specification of multicycle |
| illustrated, 502                      | defined, 267                         | MIPS design, CD4.12:11-12              |
| integration, 504–8                    | floating point, 267                  | behavioral specification with simula-  |
| Intrinsity FastMATH, 504              | hazard detection, 372, 373           | tion, CD4.12:1-5                       |
| MIPS-64, E-26–27                      | for load/store implementation, 311   | behavioral specification with stall    |
| typical values, 503                   | rank, 606, 607                       | detection, CD4.12:5-9                  |
| See also TLB misses                   | special function (SFUs), A-35,       | behavioral specification with synthe-  |
| Transmitter Control register,         | A-43, A-50                           | sis, CD4.12:10–16                      |
| B-39-40                               | UNIVAC I, CD1.10:4                   | blocking assignment, C-24              |
| Transmitter Data register, B-40       | UNIX, CD2.20:7, CD5.13:8-11          | branch hazard logic implementation,    |
| Trap instructions, B-64-66            | AT&T, CD5.13:9                       | CD4.12:7–9                             |
| Tree-based parallel scan, A-62        | Berkeley version (BSD), CD5.13:9     | combinational logic, C-23-26           |
| Truth tables, C-5                     | genius, CD5.13:11                    | datatypes, C-21–22                     |
| ALU control lines, D-5                | history, CD5.13:8-11                 | defined, C-20                          |
| for control bits, 318                 | Unlock synchronization, 137          | forwarding implementation,             |
| datapath control outputs, D-17        | Unresolved references                | CD4.12:3                               |
| datapath control signals, D-14        | defined, B-4                         | MIPS ALU definition in, C-35-38        |
| defined, 317                          | linkers and, B-18                    | modules, C-23                          |
| example, C-5                          | Unsigned numbers, 87–94              | multicycle MIPS datapath, CD4.12:13    |
| next-state output bits, D-15          | Use latency                          | nonblocking assignment, C-24           |
| PLA implementation, C-13              | defined, 395                         | operators, C-22                        |
| Two-level logic, C-11-14              | one-instruction, 396                 | program structure, C-23                |
|                                       |                                      |                                        |

| Verilog (continued)                                             | segmentation, 495                     | load instruction, 350                 |
|-----------------------------------------------------------------|---------------------------------------|---------------------------------------|
| reg, C-21–22                                                    | summary, 516                          | store instruction, 352                |
| sensitivity list, C-24                                          | virtualization of, 529                | Write buffers                         |
| sequential logic specification,                                 | writes, 501                           | defined, 467                          |
| C-56-58                                                         | See also Pages                        | stalls, 476                           |
| structural specification, C-21                                  | Visual computing, A-3                 | write-back cache, 468                 |
| wire, C-21–22                                                   | Volatile memory, 21                   | Write invalidate protocols,           |
| Vertical microcode, D-32                                        |                                       | 536, 537                              |
| Very large-scale integrated (VLSI)                              | W                                     | Writes                                |
| circuits, 26                                                    |                                       | complications, 467                    |
| Very Long Instruction Word (VLIW)                               | Wafers, 46                            | expense, 516                          |
| defined, 393                                                    | defects, 46                           | handling, 466–68                      |
| first generation computers, CD4.15:4                            | defined, 45                           | memory hierarchy handling of,         |
| processors, 394                                                 | dies, 46                              | 521–22                                |
| VHDL, C-20–21                                                   | yield, 46                             | schemes, 467                          |
| Video graphics array (VGA) controllers,                         | Warps, 657, A-27                      | virtual memory, 501                   |
| A-3–4                                                           | Weak scaling, 637                     | write-back cache, 467, 468            |
| Virtual addresses                                               | Wear leveling, 581                    | write-through cache, 467, 468         |
| causing page faults, 514                                        | Web server benchmark                  | Write serialization, 535–36           |
| defined, 493                                                    | (SPECWeb), 597                        | Write-stall cycles, 476               |
| mapping from, 494                                               | While loops, 107–8                    | Write-through caches                  |
| size, 495                                                       | Whirlwind, CD5.13:1, CD5.13:3         | advantages, 522                       |
| Virtualizable hardware, 527                                     | Wide area networks (WANs), CD6.14:7–8 | defined, 467, 521                     |
| Virtually addressed caches, 508                                 | defined, 25                           | tag mismatch, 468                     |
| Virtually addressed caches, 500 Virtual machine monitors (VMMs) | history of, CD6.14:7–8                | See also Caches                       |
| defined, 526                                                    | See also Networks                     | See uiso Caches                       |
|                                                                 |                                       | v                                     |
| implementing, 545–47                                            | Winchester disk, CD6.14:2–4           | X                                     |
| laissez-faire attitude, 546                                     | Wireless LANs, CD6.11:8–10            | V06 165 74                            |
| page tables, 529                                                | Words                                 | X86, 165–74                           |
| in performance improvement, 528                                 | accessing, 82                         | brief history, CD2.20:5               |
| requirements, 527                                               | defined, 81                           | conclusion, 172                       |
| Virtual machines (VMs), 525–29                                  | double, 168                           | data addressing modes, 168, 170       |
| benefits, 526                                                   | load, 83, 85                          | evolution, 165–68                     |
| defined, B-41                                                   | quad, 168                             | first address specifier encoding, 174 |
| illusion, 529                                                   | store, 85                             | floating point, 272–74                |
| instruction-set architecture support,                           | Working set, 517                      | floating-point instructions, 273      |
| 527–28                                                          | Worst-case delay, 330                 | historical timeline, 166–67           |
| performance improvement, 528                                    | Write-back caches                     | instruction encoding, 171–72          |
| for protection improvement, 526                                 | advantages, 522                       | instruction formats, 173              |
| simulation of, B-41–42                                          | cache coherency protocol,             | instruction set growth, 176           |
| Virtual memory, 492–517                                         | CD5.9:12                              | instruction types, 169                |
| address translation, 493, 502-4                                 | complexity, 468                       | integer operations, 168–71            |
| defined, 492                                                    | defined, 467, 521                     | I/O interconnects, 584–86             |
| integration, 504–8                                              | stalls, 476                           | registers, 168                        |
| mechanism, 516                                                  | write buffers, 468                    | SIMD in, 649–50                       |
| motivations, 492–93                                             | See also Caches                       | typical instructions/functions, 171   |
| page faults, 493, 498                                           | Write-back stage                      | typical operations, 172               |
| protection implementation, 508–10                               | control line, 362                     | Xerox Alto computer, CD1.10:7–8       |